**VU 



«o6 



MASSACHUSETTS INSTITUTE OF TECHNOLOGY 
ART1FICEAL INTELLIGENCE LABORATORY 



TT MH- FREQUENCY REPRESENTATIONS 
FOR SPEECH SIGNALS 



Michael D. Riley 



This report was also submitted ag a thesis to ibe Department of Electrical En- 
gineering and Computer Science at the Massachusetts Institute of Technology on 
May £2, 1987, in partial full hi I linen l af the rec.u Omenta for the degree nf Doctor of 
Philosophy, 



AckncwlcrlKemenls: This report describes research done at the Artificial Intel- 
ligence Laboratory of the Massachusetts IcLBiimte of Technology. Support for the 
author was provided in part by AT&T Bfill Laboratories. 



Masaachusetti- Institute of Technology 1087 



TIME-FREQUENCY REPRESENTATIONS 

FOR SPEECH SIGNALS 

by 

Michael D. Riley 

Submitted to the Department of Electrical Engineering and Computer Science 

on May 22, 1957 in partial rullfillment of the requirements for the 

Degree of Doctor of Philosophy in Computer Science 

Abstract 

Tliis work addresses two related questions. The first question is what joint time- 
frequency energy rcprescntat ions are most appropriate for auditory signals, in par- 
tic u]ar h for speech signals in souorant regions. The quadratic transforms of the 
signa^ are examined,,^ large class that includes, for example the spectrograms and 
the Wigner distribution. Quasi-stationarity is not assumed, sines this would neglect 
dynamic regions, A set of desired properties Ls proposed for the representation: [1) 
shift-invartance, (2) positivity, {3) superposition! (4) locality and (fi) smoothness. 
Several relations among these properties are proved: shift-invariant c and positivity 
imply the transform, is a superposition of spectrograms; positivity and superposition. 
are equivalent conditions when the transform is real; positivity limits the simulta- 
neous lime and frequency resolution (locality) possible for the transform, defining 
an uncertainty relation for joint time-Frequency energy representations; and local- 
ity and smoothness tradeoff by the 2-D generalization of the diu-aicaJ uncertainty 
relation- The transform that best meets these criteria ig derived, which consists 
of two-dimensionally smoothed Wjgner distributions with (possibly oriented) 2-T) 
gaussian kernels.. These transforms are then related to time-fnxji,jency filtering, a 
method for estimating the ti cue-varying 'transfer function 1 of the vocal tract, which 
h somewhat analogous to '^ps-.ral filtering gfN.T;.i.iiicJ \o Ibn time vi-, r;i lu-> cast. 
Natural speech examples are provided. 

The second question addressed is how to obtain a rich, symbolic description of the. 

phonetically relevant features In these time-frequency energy surfaces, the so-called 
scheitl&tic spectrogram. Time-frequency ridges, the 2-D analoK of spectral peaks, 
are one feature that is proposed. If non -oriented kernels are used for the energy 
representation . then tin,- 0.\^: [:,:>* ■:;-::-. be identified with zero-cms?.! rigs ir: r .he inrir-r 
product of the gradient vector and the direction of greatest downward curvature. 
If oriented kernels are used, the method can be generalized to give better orien- 
tation selectivity (e.g., at Intersecting ridges) at the cost of poorer time-frequency 
locality. Many speech examples are given showing tJie performance for some tra- 
ditionally difficult casesj semi-vowels and glides, nasalized vowels, consonant-vowel 
transitions, female speech, and imperfect transmission channels. 

Thesis Supervisor: Prof. Thomas Knight 

Title: AxsfetSJlt Professor of Electrical Engineering and Computer Science 



Acknowledgments 

I wish to thank Tarn Knight for supervising this thesis. L Also than]? my read- 
ers Tomaso PoggLo. Victor Zue n and especially Mark Liberman. I am grateful to 
Patrick Winston for hie support at the MJ.T, Artificial Intelligence Laboratory 
and Oaamu Fnjimura, Mas Matthews and Jamca Flanagan for their support at Bel! 
Laboratories, 



Table of Contents 

Chapter 1. Introduction 9 
Chapter 2. The time- frequency energy representation 15 

2.1+ The stationary case 15 

2,2+ The quasi-stationary case 19 

2.3. Non-stationarity 19 

2.4. Joint time-frequency representations 215 

2.5. Design criteria for time-frequency rep reaen tat ions 31 

2.6. Relation a among the design criteria 38 

2.7. Satisfying the design criteria 4G 

2.8. Directional time- frequency transforms 47 
29. A speech example 52 

Chapter 3* Time- frequency filtering 59 

3.1. The stationary case 59 

3.2. Non-stationary vocal tract 65 

3.3. Time-frequency filtering 71 

3.4. The stationary case — re-examined 73 

3.5. Linearly varying modulation frequency 78 

3.6. The quasi-stationary case 82 

3.7. Smoothly varying modulation frequency 84 

3. 8. The vocal tract transfer function 85 

3.9. The transmission channel &T 

3.10. The excitation 88 

Chapter 4. The Schematic Spectrogram SO 

4.1. Rationale 90 

4.2. Spectral Peaks 94 

4.3. Time-freqnency ridges non-directional kernel 06 

4.4. Time-frequency ridges directional kernel 101 

4.5. Signal detection and ridge identification 107 
4^6- Continuity and grouping 110 
4.7, A perspective H9 



Chapter 5* A catalog of examples 120 

5.1. Some general examples 121 

5.2, Semi-vowels and glides 127 

5.3, Nasalized vowels 127 

5.4. Consoatant- vowel transitions 136 

5.5. Female apeech 141 

5.6, Transmission channel effects 142 

References 141 



List of Figures 

Chapter 1, Introduction 

1.1. Steps in the initial auditory processing. 12 

Chapter 2, The time-frequency energy representation 

3+1, Short- time spectrum of a steady-atate f\f. 17 

2.2. Smoothed short-time spectra, 17 

2.3. Short- time spectra of linear chirps, 21 

2.4. S hort- time spectra of /w/*s. 2$ 

2.5. Wideband spectrograms of /w/V 24 
2. Or Spectrograms of rapid form ant motion In VarimiH contexts. 2?> 

2.7, Wisiiwr distribution and spectrogram. 28 

2.8, Wigner distribution and spectrogram of ™^j T 30 
2 + °h Concentration ellipses for transform kernels, 35 
2.1D. Concentration ellipses for complementary kernels. 48 

2.11. Directional transforms for a linear shirp. 4$ 

2.12. Spectrograms of /wloi/ with different window siaes. 53 
3-18+ Wigner distribution of /wioiA 55 

2.14. Time-frequency autocorrelation function of /w!oi/ H f 5 

2.15. Gaossian transform of /wioi/. So 

2.16. Directional transforms of /wloi/. 544, 



Chapter 3. Time- frequency Altering 

3.1, Recovering the transfer function by filtering. 63 

3.3+ Estimating 'aliased* transfer function* 66 

3.3. T-F autocorrelation function of an impulse train , 75 

3.4. T-F autocorrelation function of LTI filter output. *& 

3.5. Windowing recovers transfer function. 77 
3«fl. Shearing the time- frequency autocorrelation function- 70 
3.7. T-F autocorrelation function for FM Alter. Si 
3.3. T-F autocorrelation function of FM Alter output. S2 

3.0. Windowing receivers transfer function. g.3 

Chapter 4* The Schematic Spectrogram 

4.1. Problem* with pole- fitting approach. 03 

4.2. Peaks in spectral cross -sect ions of the energy purface. 05 

4.3. Gradient and curvature vectors near rising F2. Qg 

4.4. 2-I> ridge computation applied to the energy surface* lot) 

4.5. Two conditions for ridge detection* 102 

4.6. Tuning curve* for gausslan transform kernels. 104 

4.7. Transform kernel 4>{t,f) = -&#(*»/)* 105 
4. ft. Tuning curves for kernels of the form in Figure 4*7, 106 
4,0. Ridge tops computed with directional transform* ins 

4.10. Hysteresis thresholding. 112 

4.11. Two contours competing for labelling as F2. 113 

4.12. Turning an /I/ into an /u/ by filtering. 1 1 i 

4.13. Spectrogram of /wi/. Hfi 

4.14. Merging formants. 110 

4.15. Contour junctions located by simple proximity rules* 119 



Chapter 5. A ratalog of examples 

5.1. "May WO alJ learn a yellow lion roar," 133 

5.2. "Are we winning yet?" !24 

5.3. '"We were away a year ago** 12 S 
5 + 4< "Why am I eager?" 120 

5.5. ^w/ K S at Vflrimift speech ratea, ]2^ 

5.6. Syllable initial /ju/'s at various speech rates. ijjq 

5.7. Sy Ruble initial /!/*■. i32 
5+8, /r/ T s in various vowel contexts, 1^3 
£-B< Nasalized vowels 535 

5.10. Syllable initial /b/'i. 1S7 

5.11. Syllable Initial /d/'s. 13s 

5.12. Syllable initial /g/ J §, I3B 

5.13. Rapid formaat tranuistioa*. 140 

5.14. /uluiui/ uttered by adult female. 144 

5.15. Transmission channels* 145 
5*16. Broadband channel'a effects, 14fl 
5.16 + Narrowband channel's effects, 148 

5.16. Stopband channel's cfFecta. 24^ 



Chapter 1. 
Introduction 



In order to perceive speech and other Bounds, the incoming sound wave must be 
transformed into a variety of representation^ each hanging forth different aspects 
of the signal, its source, and meaning. Understanding how we perceive and how 
machines can be made to perceive auditory signals meana, in pari, discovering 
appropriate representations for the signals and how to compute them. For many 
kinds of Rounds, tittle is known in thin respect. What auditory features, for example, 
will distinguish a knock at the door from a footstep? 

For speech signals, more is thought to be known, A phonetician will tell you, for 
example, that the- /ae/ in had can be distinguished from the /I/ in head by the 
location of characteristic peaks in their respective spectra. He could even train you 
to identify a wide variety of phonetic dement* by looking at their spectrograms. 
Formalizing this knowledge, however, so that a computer can do this well (in a 
general setting) has proved hard. 

An analogy may explain why, I could train you to distinguish a Mercedes from some 
Other car easily; I would just describe the hood Ornament, t To train a machine 



t I thank Mark LibM-Jaiar far this example. 



Ch. 1, Introduction l£ 

to do this task would be much harder. Not only would I have to describe the hood 
ornament j hut I would also have to provide all the visual abilities that 1 take for 
granted with a human — finding edges and boundaries; recognizing closed, forms, 
etc. I believe the failure to correctly provide the corresponding auditory abilities 
— finding spectral *peaks s and temporal discontlnuites,, recognizing continuous 
forms, ate. — is an important reason why the speech recognition problem has been 
so difficult. 

This problem is in some ways even harder than visual analysis. In vision, it is clear 
that the two-dimensional image it a natural starting point. In audition, a similar 2D 
representation Is important, with time along one axis and Frequency along the. other. 
But how should tins idea be made precise (the well-known uncertainty principle, ^f 
■Fourier analysis is one of the thorny issues involved)? Should we use the ConYcntionai 
spectrogram, the Wagner distribution, a pseudo-auditory spectrogram, or something 
entirely new, and how should this decision be made? 

Tn vision, the notion of edges, lines, and so forth obviously are important features 
of an image. Tn audition, it is harder to decide what are the appropriate primitive 
elements. Can some symbolic description summarize the relevant features of a 

sound's timu-frequcne y rcprcE.cnlj.tion analogous to how a line drawing summarises 
an image? 

These questions about the early steps in auditory processing are the topic of this 
thesis. The emphasis will be on speech signals primarily because the intermediate 
goals to which the initial computations must aim are better understood r I believe, 
nevertheless, that many of the auditory processing issues discussed here are also 
relevant for other kinds of sounds. 



Ch, J, Jn troducjjb n n 

The topic as stated is still too broad. Speech and other signals axe made up of many 
different kinds or components. For instinct speech lias fairly smoothly changing 
vocalic rep on? that arc- qmte different from the more discontinuous structure of 
consonantal regionR. It is unlikely that the Same initial representations will be 
appropriate for every kind of signal. The- emphasis here will be an signal* like those 
Found in the mors continuous^ sonorant regions of speech- 

In the sonorant regionR, we find an apparent feature 1e local spectral energy con- 
centrations that vary in center frequency with time. These peaks are due-, in part, 
to the "resonances" of the vocal tract - the SO-ealfced formani s, The- formant loca- 
tions (labelled Flj2j.,. in order of increasing frequency) specify the general vowel 
quality, r-coloring and roundness, white the formant transitions between consonants 
and vowels play an important role in consonant identification (see e.g. Chi ha & Ka- 
jiyama 1941; fcuil i960- LLberman, et al 1954; LadcToged 197$]. A. Liber man, in 
fact j claims that " r . ,the second formant transition. . ,is probably the single most im- 
portant carrier of linguistic Information in the apeech signal [Liberrnan, et al 1967'. 
Thus, restricting, the discussion to these regions is by no means uninteresting. 

The initial speech processing envisioned here haa been divided into two steps. The 

first step, which produces a joint time-frequency representation of the signal energy, 
is explored in Chapter 2 and Chapter 3. The second step, which produces a symbolic 
representation thai captures the acoustically relevant featnTes present in the joint 
time-frequency energy representation, Is explored in Chapter 4 (see Figure l.l). 

One of the most difficult problems in deriving the form of such representations is 
deciding which properties or axioms to assume at the outset. If strong assumptions 
are made about the received signal, then rigorously defined optimal detection can 



Ch. 1. Introduction 



12 



o 



CO 

I/? 
W 

O 



Speech 



r 



Ti me- Frequency 
Energy Representation 



1 



(«) 



Schematic Spectrogram 

* spectraf peaks 

* time discontinuitifiE 

* spectral balance informal ion 



I 



Cb) 



. I 



Acoustic Representation 

» excitation - pitch 

» vocal tract, - form&nta 

- transmission channel 



to 



FigurH 1,1, The initial speech processing is seen as divided into two steps, fa) The 
Stst step represents the sign*! energy as joint function* of time and frequency, (b) 
The second step builds a symbolic repreMjataiion of the significant features prtseni 
jh the joint time-frequency energy representations. At this step, which we call the 
schematic zpectmgrani, there is no undue commitment to the acoustic origin 
of the features represented; it is a description of the signal, not its sources, (c) 
In subsequent processing, these initial descriptions can be used to decompose the 
signal into its acoustic sources. 



Ch. L Introduction ,, 

result. For example, if we assume that the received signal consists softly of a 
known signal in additive Gaussian noise, then we could build a matched filter that 
performs optimal Bayesian detection e. M ., afle Van Trees 196B1. The difl*dvaiiU K e 

of such strong assumptions is that they are seldom universally valid for natural 
perceptual signals. 

On the other hand, weaker assumptions made about the received signal can be com- 
bined with assumptions about the design or the representation, things like linearity 
continuity, locality, and stability, that can result in a solution kf r Marr k Nishi- 
hara ! . These design criteria are chosen not on the basis of ft specific signal model, 
but instead as reasonable choices that should be appropriate for a wide range of 
natural Signals. The disadvantage of this approach is that the justification of the 
design decisions is more intuitive and abstract. 

In the best of circumstances, the two approaches would result in the same or similar 
solutions to a problem. Thus the auditory processing would perform optimally {in 
different senses] when both appropriate weak and strong assumptions are made 
about the received signal, 

Chapter 2 derives those joint time-frequency energy representations that satisfy 
a small set Of desirable properties; these properties are intentionally kept quite 
general. Chapter 3 re-examine* this problem in a more specific setting. Given a 
(time-varying) model of speech production, what time-frequency represcnta.tion of 
the signal best depicts the transfer function' of the vocal tract while suppressing 
the Excitation. These two approaches, in fact, yield simitar solutions. 



luce a 



In the initial part of Chapter 4, a general h heuristic argument is used to prodi 
phonetically relevant, symbolic representation of the signal, in a later part, these 



Ch. 1, Introduction 1^ 

solutions are briefly related to a signal detection mode]. 

In Chapter 5, we look at a wide range of examples using these proposed methods, 
We examine same traditionally difficult speech eases — glides and semi- vowels, 
nasalized vowels, consonant-vowel transitions, female speech, and imperfect trans- 
mission channels. 

NiB*i For the figures in this thesis, time is in second*., frequency in 
Hertz, and energy in decibels, unless otherwise indicated. 



Chapter 2. 

The Time- Frequency 
Energy Representation 



Thia chapter explores the deEien of joint time-frequency energy representations for 
speech signals. A set of desirable properties for such representations to satisfy is 
proposed, and the relationships among these properties is discussed. This includes 
a general treatment of the 'uncertainty" relations that arise. The sienal transforms 
that best satisfy these properties W e then derived and examined. 

2*1* The stationary case 

W-e begin with an analysis of the special case of stationary eignala. There is a larjje 
literature for this case; Rabiner & Schafer ;iG78| and Flana K arj [1972] provide good 
reviews. The discussion of it here is very condensed and confined to topics that arc 
relevant to the sequel. 

A Stationary Signal is used here to rough I y mean a signal whose frequency content 
does not vary with time. More precisely we consider only deterniinstic signals that 
are periodic and random signals that are correlation-stationary. For both kinds 
of Signals, the power spectrum, the fourier transform of the autocorrelation fuuc- 



§2.1. The sta-titmiiry case W 

-iM, raptures naturally the energy present at each frequency, t Time is removed 
from this representation; the power spectrum is a one-dimensional representation 
of fl.ne.rgy as a function of frequency. 

For speech signals there are, of course, no completely stationary signals. We can, 
however! deliberately titter vowels so that they are steady-state for as long as we 
like. Figure 2.1 shows the spectrum of a long duration, voiced /}/. We Find in the 
spectrum many of the characteristic features of a steady-state vowel. 

Let ns examine the spectrum in Figure- 2.1. Note the y-axis is logarithmic to com- 
press the wide dynamic range of the speech. At a fine Scale In this spectrum, there 
are peats spaced about every hundred Hertz, these are the harmonics of the pitch, 
The somewhat larger scaJe peaks, of a few hundred Ha bandwidth, are the. Formant 
peaks. The peak at about 300 Hz is Fl and the peak at about 2304 Hz is F2, which 
is characteristic of an /!/ vowel for an adult male. Still larger scale shaping of the 
spectrum, so called spectra! balance, is due to the formant locations, the nature of 
the voicing and the transmission channel. 

The spectral structure of a vowel, there Fore, is due acoustically to several factors: 
(l) the vocal excitation — e.g., voiced; (2) the vocal tract transfer Function, char- 
acterized by its resonant frequencies — the formants, and (3) the transmission 
characteristics — e.g., room acoustics. Determining these factors from the speech 
(i.e., finding the forrnant Frequencies, the pitch, etc.) is an important intermediate 
step in speech analysis, since they decompose the signal into components of nearly 
independent origin, arid are [thus] starting points Tor the phonetician^ description 
of speech signal. 



f Fur a []elrrnumHP,ic signal t[1), iL» aUiDC*rrtlatk>a fu actum, b jf z(i + T Jx' \i) dt, in J _.-r a, H r -;iU baity 
random pracesu \f\t), ''-'* smiucurreLilnja fujLLliuu is £ \#\t 4- t)f/*(t] ■ 



§2. J. The Stationary f &se 



1? 




■™ "" ™ » »- iiii m. »» «ii ^T^* 



Figure 3,1> SAort-time Jo^ sjMCtfUm o/a irteady-state /J/ TAf finest SCaJe struc- 
ture corresponds to tJit harmonics of the pitch, spared ai>OUt every WO Hz. At an 
intermediate scale are the fotmant peaks; e,g. t Fl at 3J30 Hz and F2 at 3300 Hz, At 
the }&rge-&x scaJe is the overall Spectral balance. 





Figure 2,3. Spectrum in Figure 2.1 smoothed to suppress the excitation, (a) 
Log spectrum convolved with g&ussian fcepatral smoothing), (h) Power spectrum 
convolved with g&U$$i&n (and then transformed to ft Jog scale). 



§2.2. The quasi-stationary caae 1& 

A key point in separating these- factor* in the speech signal is that they operate: 
at somewhat different scales in its spectrum; the fins scale structure is due mostly 
to the excitation, while the intermedi ate scale structure ii due to the vocal tract 
transfer function. A common technique for selecting a 9CaJe of interest LS to smooth 
the spectrum by linear convolution,, or equivalent!],', to window the Fourier transform 
of the spectrum. The fancier transform of the log. spectrum is called the cepstram, its 
dimension quefrencies, and the smoothing performed eepstraf smoothing or littering. 
jOppenheim I960; Oppenhcim & Shafcr 1975]. Figure 2.2a shows the spectrum in 
Figure 2.1 after it has been cepst rally smoothed at a scale to emphasize the forrnants, 
and suppreea the excitation. We shall see in Chapter 3 that this operation, in fact, 
effectively separates excitation from transfer function in certain idealized, siationaTy 
caeea. 

It is smoothing Lhe power spectrum, not its logarithm, thai most easily generalizes 
to the non-stationary case later. We will therefore select our scales of interest by 
smoothing the power spectrum instead, or equivalent^, by Windowing its Courier 
transform, the autocorrelation function. Figure 2.2b shows the spectrum in Figure 
2.1 after it has been thus smoothed. | 

What should the form of the convolution kernel in this smoothing operation be? 

A desirable smoothing kernel would have good locality (or resolution] for a given 
amount of smoothing. In other worda. it would have small duration for the given 
duration of its transform. These two durations are related by the uncertainty prin- 
ciple: given a function h{x) with fourier transform r/(d], if the Variance of A(j)| 3 is 
(Ajf) ! and the variance of |# (a)| s is (As) a , then ArAj > £ |Bracewell 197SJ. Marr 
<k Ilildreth [1980] proposed a gausrian smoothing kernel (in a vision task) because 



f Eifcipij-kilty, pLjw^r illd log flmoothiinj. uflen prciflqcp (ImiUr r*aultfl. 



§2.3. Nn r\ »£ta tiona rity _ * a 

it is the unique shape ihat meets the uncertainty principle with equality. 

2*2+ The quasi-stationary case 

The previous section examined the analysis of stationary speech signals. No real 
speech Signal, of course, is purely Stationary. If the frequency Content of a sig^a] 
varies slowly with lime,, however, there is a simple extension of the previous results. 
The idea is to examine the signal aver a short duration window. Given a signal x (t) 
and ,l window £(i), the shotUthne power spectrum at time t is 



$*(t>*») = 



■x> 



j g(r)x(t + r ) t -^dT 



-■■x> 



(2.2.1) 



Considered aa a two-dimensional function of time and frequency, this si final repre- 
sentation is called a spectrogram. Many different window shapes have heen used; 
they typically are symmetric, wnimodal, and smooth e.g., a gaussian or a raised 
single period of a cosine. 

Signals for which a window can be found whoee duration is long enough to allow 
adequate frequency resolution, but short enough to allow adequate time resolution 
are called quasi-stationary. The example of the previous section was t in fact, a 

quasi-stationary vowel. Virtually at! speech analysis methods in the past depend 

on the quasi-stationary assumption. 

2.3, Non-eta tionarity 

There do exist si finals for which no window duration is adequate. A very simple 
such signal is the linear chirp, e*T mf \ whose instantaneous frequency increases lin- 
early with time. The quasi-stationary assumption breaks down for suRicently Large 
modulation slope m of the signal. Let us examine this claim. 



§2.3. Nnn-gi&tionarity ^__ . 20 

By the uncertainty principle, the product of the time duration At and the frequency 
duration [bandwidth] j\u of a window is bounded below by 1/2. The window 
duration and bandwidth, in turn, determine the time and frequency resolution, 
respectively, in the short-time spectra. * In other words, if the window duration 
is too small, then thE frequency resolution will be poor and if the window duration 
is too long, the time resolution will be poor. Further, fur a non-stationary signal, 
poor time resolution can also mean poor frequency resolution since the frequency 
content will have changed over the duration of the window, blurring the spectrum. 

To illustrate these points, consider the short-ttmc spectrum of a linear chirp, t i ^ mi * t 
using a gaussian window, e - *^^ We can measure the the relative bandwidth of 
the spectrum for different window sizes (o'l) in terms of the Standard deviation of 
the spectrum [=s.<i2 the half-power bandwidth), which is y''(m 2 /F 1 + IJ/2ct 2 , wh^re 
the units are seconds and radians. Note that when m ^ Q, this grows without bound 
as the window size beta^fiK very small or very large. It ha* a minimum value of 
\fl"i which occurs when the standard deviation of the gaussian is l/^/rri. 

We see from this that the minimum possible handwidth of the short-time spectrum 
of a chirp (using a gaussian window] grows with, increasing modulation slope. Fig- 
ure 2.3 shows the short-time spsctra of chirps of various modulation Elopes using 
windows that give the minimum bandwidth. For a slope of 50 Hz/msec, the chirp 
peak lias been broadened by several hundred Hz in the spectrum. The point here 
is that, in theory, the usual quasi-stationary spectral anaJysiR methods will Rive 
poor resolution for sufficiently non-stationary signals. A few examples from natural 
speech will show that such conditions arise in practice. 



t This ii mad* praise hy Tliwrcni 1> in S«tk>b 2.6. 



§2r3. Non^StAtionarity 



21 



i \ i 
■i i- i I I- i = . 

i 



. 



■■+ 



i I — 



ii ■! i ii 



i I i > i i | ■ ■ "T" 



-i— 



...4„ 



window ptnd. dcv.: Infinite 
1 ! \ !--■ j 



! 



-i JL. 



■ 



t 



— f- 
i 



Ut dH 1IW IMI 111! 3*t* iffp J.UI 11M J 



I I 1 ■ ■ I 1 J " 



itfl 11H JIM mf 4t4l 



(a) 









_, 5 1 u 



1 -]—\- 

• ■ ..__.A ..j. 



• : 



I ' ' ' i 




window nliid, dev r : 40 m*M 

" "T j "'] 



■ 



~t — 1 

i > 



rrj- - -rt-i^ ri-| — 1 1 ■ 1 — 1 1 1 1 1 ■ ■ ■ ■ 1 1- ■ - , 

HI <ai F» |4M izii Hit lhL| mi jrjl JHI jrsi Mil J±LI • ■•* JT9I H* 



(t>) 




*lope: 25 H*/xiipru 
window atnd. dtv.: 25 mure 



" ■ P - 



■i 4— 



i 



1 ■ 



III HI fif 



(c) 



^ T-I— I— I— J-I— «-|— i— p^v ■ - | 

±>ll II* 1TH 1HI jjl* Ml jTSI 41* 



* 

-rt- 



•;^t^.\ ■ -.■ 



-+■■ 

: 



: ; 



i- -f- 



-window atnd. dev,; 1ft innee 






! i\ i 

/ ■■ — f — -\ ! i i ? -; ; , ,. i / A ) 

j I 7] _x.„ , |\ j |_ ! W 

J_ j L 11/ \ ; 






..+ .f. — ,>•- 



T-n-p-rn-n-rTT^ m— 



,-,■ 



' ' ■ i T - - ' ■ " — ■ <-^-\ — i-«-i — ■ rr q i ■ ■ ■ * T * tr ' ' ' — i | i t i i i i i | i i . . , ,-,-j 

*•» 1H. »" '*» «*• H** i" a*** 2Ji* JJi* ■-■ ,im ji#i J'ii ••!• 



Figure 2.3. Short- fcjiiie spectra of jjjiear c/iir jjs ofs-evera.? moduJatjujj siopsF using 
jjau^jaji windows that give the minimum bandwidth. At the Jarflflsi slope, the chirp 
peak is significantly broadened. 



§2,3. Nan-sta.tion.a.rhy § ___ 22 

Figure 2.4 shows cepstrally smoothed, short-time spectra of various fwfa., uttered 
first slowly and then increasingly rapidly. The spectrogram window used was a 
gaussian of 4 msec standard deviation, which has an effective duration of about a 
pitch period, the minimum duration that gives a reasonably stable Spectral esti- 
mate. The repstral window is also chosen as brief as possible, while still removing 
the harmonic peaks. Notice that the peak in the spectrum at about iftOO T\z v corre- 
sponding to F2, grows In bandwidth with the taewasi::" yJopu of F2 as Eeen in the 
corresponding spectrograms in Figure 2.5. Tn case fc] T where the F2 slope is about 
40 Hz/ msec, F2 is 50 broadened that its peak (i.e., the locaJ maximum} is lost in 
the short-time spectrum. Such an F2 slope is not uncommon for a /wf. In /j/s, F2 
can have large negative elopes, and in />/ contexts, F3 can have very steep slopes, 
see Figure 2.6. At consonant- vowel transistions, where the formant trajectories are 
considered very important for stop consonant identification iLiberman, e£ al 1954] , 
the formant motion can also be very rapid; again see Figure 2.6. 

It is worth noting that natural sounds other than the human voice can produce 
non-stationary signals that are "■chirped.* For instance, bind song and bat erica 
contain many rapid FM chirps ^Grcenewali jyfia.; Marler L979; Nenweuer T^7"?j. If 
a sound source is in relative motion to the listener then Doppler effects can cause 
large frequency shifts in the received signal across time [e.g., Dudgeon 19JM], f 
Gliseandi of various musical instuments provide still more examples of signals that 
contain rapidly time-varying spectral content. 

it is alio suggestive that neuxophysiologists have found that a large population of 
the auditory cells in the mammalian eochear nucleus do not respond optimally to 



f Sramn lint? (th*> B0-call«d CV bitfl] emit tuJL^ULUous iDnea, evidently dfiprnrHiifr mi DoppLtr shifts for 



52-3, Non-sta.tiona ritY_ 



2Z 




ISO* I til Jill Hit Hit JT1» lit* JJIl i;*» JJH 4*J 




-I3II 




Figure 2.4. CeptfraJIy sjEtoofttecf, sJiorHime spectra of fw/% uttered tfrsi very 
sfow/y, tften ijiciwingfy tApidfy, In (c), F2 is so broadened by the, analysis thai Its 
priaA (i.e., the local maximum) disappears. 07. Figure 2. ,5. 



tj&3, Noii-st&tiona-rit y 



24 



4nnu 




Jlft 



!••• 



I»M 



I i i . i I i i i i I 
.1 *.l 01 




Jtnnn 



JtPP- 



1111- 



!••■- 



i n r I — i i ■ i T i i ■ i — — | — n - 

ii 0.1 1.2 9.1 




9.1 9.2 



00 



(b) 



(0 



Figure 2r5r IV7 de-hand spectrograms of the /w/'s used in Figure 2 A, Note that 
F2 remains clearly viable with increasing slope in the. fwn-rfjmenmonA? display. 



continuous tones, but instead to sweep tones, with different populations responding 
to different preferred modulation slopes ranging over ±15 Hi/ msec [MeiNer 1978- 
Br itt ii Starr 1076]. Further, psychophysical adaptation studies have shown ainii^r 
direct ional selectivity in the human auditory system [Kay fe Matthews l&72j Rc«an 
& Tansley 1079]. 

The Above comments are meant to call into question the vaJidtty of the quasi- 
stationary assumption for speech and other auditory signals. We have seen that 
speech is not uJwuys quasi-atatiu]iary 1 even in the sonorant regions. Assuming so, 
means that important features will be missed, having heen blurred by the ajiat- 



52.3. __Non-st&tionaTitY_ 



25 



4Mt 



JDflfl 



2t9t 



!••• 





3t»~ 



«■•- 



Lit* 



I- 






I ■ i J? ' : ! r*. r . i i ■ 




1 | ■ i i i I ■ ' i i J i r p i I i 1 1 i | m i i 
k *,1 1,1 «.) 1.4 t,J 



(*) 



C*J 



*fM-i 



JIM- 



MM" 



!■••- 



T'i^SSii 




4*Dd-| i. : 




r f i' i | i > i i | i ■ ■ i I 

• n os t.i i.ii 



jm- 



^nnn 



i onn 



('4B». 



iTT '* 



i 
'A 



m 



r |J 





C-p-r. | ■ ■■ r | 



(c) 



(rf) 



FJg^re 3«0, Spectrograms of rapid formant motion in various context?, (a) /jti/ 
fb; /ar»/. ft) /hi/ in the contest /tabi/. (d) /du/ in the context /lidwf. 



$2.4. Joint tim*-fret)uency representations 26 

ysis. It !fl interesting to mote that while the individual short-time spectra, of the 
non-stationary signals described above give a poor description of the signals, their 
spectrograms are nevertheless quite legible. This is because when we look at a 
spectrogram, we are not confined to examining them one-dimensionally along single 
frequency slices, but instead we see a two-dimensiona] time and frequency surFace. 
In other words, time is not used as a parameter that varies over a family of spectra, 
but as one of the intrinsic dimensions of the representation. 

I believe,, in fact, that thinking of the initial speech processing as consisting of a 
family of independent one-dimensional spectral analyses parameteriieid by time is 
inapproptate. The problem should be thought, of as a joint time-frequency analysis, 
with the relationships and trade-offs between the two dimensions directly addressed, 
which bringe us to the next section. 

2,4* Joint time-frequency representations 

Various ways, have been used to express signal energy aa a joint function of time 
and frequency. Certainty the most popular is the spectrogram, 



$*[*M = 






(2.4. 1J 



which is. just the short-time spectra, described above displayed two-dimensionally. 
The fact that the simultaneous time and frequency resolution in the spectrogram is 
bounded by the uncertainty relation nas led others to seek representations that do 
not have- this limitation. 

This is usually formulated in t«m* of the marinate (or projections] of the signal 



§2.4. Jo i nt time-fretj u a n cy representa tions 27 

representation F x {t,w) ^ Cohen I9fifi\ Let 

aa 

*l(*} - ^ j ftM^, (2.4,3a) 

x 7 (w)= j F % (t,u)dt. (2.4,3*) 

Perfect time and frequency resolution in thia formulation requires that 

M*) = k(0r i»"l M«) = I^Hr (2.4,3) 

An example cf a joint time-frequency representation that satisfies these require- 
menta ia the Wjgner distribution, 

aa 
«%(*,") = / «■*"*(»+ r/2)e*(* - r/2) <fr, (2.4.4) 



— M 



which lb currently quite popular Ln the signal prcureering literature (Classen & Meck- 
lenbriuker 19S0a.,e]. 

The Wigner distribution of an impulse, x[t) - 0f| - t<>) fa H^(*,w) = S[t - t ), i.e., 
the si E naL energy its taken to lie on the vertical line f ■ = t Q In the time-frequency 
plane, Similarly, far a complex exponential v(e] = #*■"•', the signal energy Ilea on 
the horizontal line at w - ^ (^(f,u?) = 2*% - ^J), a „d for a linear chirp, 
s(e) = eimt-f-^mt*}^ jjjg iM1<y ]iee on ^ 5 ] ailted | ijlc w _ mt +Wo (w,(t lW ) = 
2x^(w - wq ■ • tn()) (see FiRure 2,7a). 

In contrast, the Spectrogram of these signala cemsist of broadened. Linea (see Figure 
2,7b). There is, in fact, a simple relation between the spectrogram and the Wigner 

distribution of a signal £(£}: 

S x (t,w) = ±W t [t t u)**W,{t,ul (2.4.1) 



§.2 A- Joint time-frequency jv-pr?Jientatiana 



28 




-I 

%J 

■ - 

■a 



8 
p. 



-n 



w 



U 




:iu 




Figure 2.T. Winner distribution and specirogt&m of some mono-tiftmpa-nent sig- 
n&Is. {&) The Wiglitr distribution resaives these sign^Ie as 1 perfectly fi&rruw lines in 
the time-frequency phuie. (b) The spec Ingram is a. smoothed version of the Wigncr 
distribution (e.g., if the spectrogram window is a gaussian, then the smoothing ker- 
nel is 3.2-D gaussian^. The tines are broadened in this representation. 



82.4. J oint trnic-frtujuencY repregfriiiatjq nj; 29 

where ** denotes two-dimensional convolution and W g in the Winner distribution of 
the Window ;Claraen fe Mecklenbraulrer 19SQc]. If j?(i) is a gauseian, -1. e^V^*, 
then its Wlgncr distribution is also simp]*; it k Just a two-dimensional gaussi&tt, 
"'ff'i*'! = ^£*~ l '* c ~" * ■ Thus, the two-dimensional convolution of the Wigner 
distributions in Figure 2.7a by a two-dimensional gaus*ian wilt give the Spectrograms 
in Figure 2,7b, 

If the duration of the gauBsian spectrogram window is decreased, then the 2-D 
gauasian that, in CSsCnse, convolves the Wigner distribution to give the spectrogram 
becomes narrower in time, but wider in frequency, and vice versa. It should be de&T 
from this example that the spectrogram dues not meet the marginal requirement, 

On the other hand! the Wigner distribution itself has some undesirable proper- 
ties, h; particular. multi-COmponent sigrjpJi give ri^! f.o <:o±a turmti that calljiu! 
be attributed much physical significance. For example, the Wigner distribution of 
x{t) = coaujct is W,(£,w) = ||£(w - wq) + *(w + wo) +■ S(w)2cos2i^at] [sec Figure 
2.8a), The last term, which lies an a horizontal line at the frequency origin {varying 
sinusoid] Ey in amplitude), seems spurious. The spectrogram of ew^i, however, k 
just two broadened lines at u = ifctuu, which seems better behaved with respect to 
superposition, since cqs wtf = \[b™« 1 + *-****■) (see figure 2,Sb). The cross term is, 
in effect , smoothed out by the convolution that transforms the WijrneT distribution 
into the spectrogram. 

These examples illustrate that there are various {possibly conflicting) properties; 
that wt might desire of a time-frequency representation, e.g.> good time and fre- 
quency resolution, and superposition for multi-component signals. We shall, in fact, 
approach the problem of choosing our time-frequency energy representation by first 



§2.4. Joint timm-freQ-usncy representations 



30 



wo 



r 

5 

I 



■ - : H 



(«) 



wo 






1 



—wo 






(b) 



time 



.Figure 2.&i Wigner distribution and spectrogram for coeval, (a) The Wigner 
distribution of this signed has the 'spurious* cro&n term (S(w)2cos2woi at the origin, 
(b) The spectrogram (foes not show this tvrm; it has been, in effect, smoothed QUt, 



S^ g - — JJegign criteria for joint time-frequency reprvsentAtioiis 31 

specifying a set of desirable properties that the transform shonfd satisfy, and then 
deriving Its form* 

2.5. Design criteria for joint time-frequency representations 

We will restrict the discussion to the quadratic tranaforioE of the signal, which have 
the font] 

F*(t,u)= J / h(ri,JK(,u}x(7i)4*(Ta)dricfrj T (2.S.1) 



OO -M 



where fc{ns,r 2 ;* p u7) is an arbitrary function. This condition is imposed because, it 
results in a particularly manageable class, and because the representation of energy 
as a quadratic function of the signal seemE reasonable by analogy to other definition* 
of energy. The class is quite large and includes many of the joint time-frequency 
representations that have been previously proposed, such as the spectrogram^ the 
Wigner distribution, and the Rihaczek distribution [cf. Claasen h Mecklenbriuker 

From this class of representations, we seek ones that satisfy the following criteria: 

(Cl) Shift in variance: A shift in time or frequency of the signal should result in 

a corresponding shift in time or frequency in the transform. Let g{() = x(t-r) and 
j((J = e^xff). Then we require F v {t t u) = F*(t-T t v) and F r {t s u) = F^i, w-p). 
This property ia desirable if we want to interpret the two dimensions of the transform 
aE time and frequency. 

Transforms satisfying this condition can be put in the forms 

M*. u) = — ${ t, u) * * W x [| T w) (2 . 5.2) 



§2.5. Desisji criter ia f or joint tinzB-frcgutxiey repnesejiMtio-nfl 32 

and 

F s {t,v) = rXn^*!^)], (2.5,3) 

where **** denotes two-dimensional convolution, W x Is, the Wign«r distribution 

W a [t t ut) ■= j e- ih "x{t + r/2)x*(* - r/S) <fr P (25A) 



^{^w] is an arbitrary kcrm.4 function, J 13 the 2-D fourier transform in the form 



tLicie-frtquency autocorrelation function t 



CD 

A x (t,») = J [W,[t t »}] = j t-^ l x{t + rf2)**{t -rf*)& (2.5,*) 



-09 



for £(t) [Ctaasen fe Meeklenbr&uker l9H0c]. Note that for a spectrogram, $[l>w) is 
the WigJlGT distribution of the spectrogram window j by Eq. 2.4.5 add Eq. 2.5.2. 

(C2) Fosltivity: The signal energy at a given point in time and frequency should 
be real and positive: Fs(f>w) ^ for all i, t t and w. This seems appropriate 
for interpreting the transform as an energy distribution. Some authors have argued 
against the positivity requirement |e,g, Claasen fe MecklenbrHuker 196Qc|. We shall 
examine the consequences of lifting this condition in the next section, 

(C3) Superposition: This idea is that the time- frequency representation of a 
multi-component signal should be a simple superposition of its components. The 
straightforward linear formulation of this,, i.e., F x + cv {t,u>) = F^t^u?) +• tF^t^)^ 
however, is Inconsistent with the quadratic nature of the transform, and the thirt- 
invariance property Cl, This apparent shortcoming is also true,, for example, 



x Some authors call thts the ambiguity function |e.g., Olaaeen J; Meclknbrauker 1960aj; others t&ttvn 
thin Urm for ^(r, >/)| s | fl . B ., V*n Trwi 196B|. 



£££ — -Pefl>n crif fffia for Ittiut time-Fr fxmencY representations 33 

nf the spulnKTBRi (Eq. 2.4.1]. Nevertheless we usually think of the conven- 
tional spectrogram as being well-behaved under superposition. This is because 
non-overlapping components do superimpose, i.e., S^fcw) - 5 = (t 3 w) + $ v [t t v) 
when 5 a {t, w)S v (t , w) = 0. TliEre are no cross terms in this case. On the other hand, 
the Wigner distribution does not have this property, suffering from cross terms to 
■which there cannot be attributed much, physical aignifkancer 

We shall require this property for our time-frequency representation, namely 

*W<*«)= *"■(*! "O+lfcw) when F x {t y w)F ¥ [t*v)=Q. (2,5,6a) 

More generally, we would like F t+u (U " ) « F^wJ+i^flH whenJ^wjF^w) * 
0. Stated more precisely, we require tor any f > Q, there exists a S > such that 

\F I+y ( t, w) - [F r (l t w) + F tf (I , wj ] | < e when | F a f I, qj) F tf (t T w] | < ff , (3.5 .66) 



ft-mam 



(C4) Locality; Signal energy that is localized in time-frequency should 
localized in time-frequency in the transform. The advantage of the Wigner distri- 
bution is that it k perfectly localised according to various criteria, such as preserving 
the marginal distributions (Eq. 2.4.3) and the finite support properties "see Ciaasen 
& Mecklenbrauker i&aOaj, f The Wigner distribution, however, does not satisfy 
the positivity (C2) or superposition (Ca) properties, as indicated earlier. In tact, 
positivity (and thus, as we shall see,, superposition) is inconsistent with the time and 
frequency marginal conditions [Claasen fe Mecklenbrauker 1980c]. Fortunately, for 
our purposes;, we do not require perfect locality, so we can relax the above conditions 
somewhat- 



■ The Gnitc Bupport property states thai if a Big-nil h** fill it* txLeul in time or frequency t|i«i) Ltfl 
representation wil] tiave the same extent b the eftnreiixiEjdiiig VMdAtJe. 



%2 &. DEuen criteria, for joint time- fr equency. repreran tatiom J± 

From Ec). 2.5.2, the transform fcernel #■(*,") can be vie wed as the paint spread 

function on the perfectly localtie-d Wigncr distribution^ We can therefore infiaiurc 
(he locality of the transform in time and frequency in terms of the variances % 



a:u 



where we assume that the center of mass of j^[fj<j]| a is at the origin. * 



Tn general, these two measures arc not enough] an additional locality measure is 
important, the covariance 

°" - // W f lH )|.*4. ■ W> 

Together, e±, ffcj, and t?t\n determine the covariance matrix and the associated con- 
centration ellipse in the (t,w) plane, 

c< ->(£ sf {:)->■ p-«' 

Wheci c (uf — t the major, and minor axes of the concentration ellipse coincide with 
the time and frequency axes [figure 2,0aj, More generally, the concentration eElipse 



The frencrab'ty of this approach depends dei tbe Wigner distribution uniquely latLxfyiog ''perfect' 
locality. Goben ban shown thai a quadratic transform thai uatisuea tilt shift- in vjj-JajLCi properly 
|Cl) will iinfet LJlh time and frequency marginal conditions (Eq. 2.4.3] if *(^,0) — 1 foi- all ? and 
S|n, J/J| = I far all v. Theae marginal conditions essentially guarantee that an impulse and a cnmplftx 
exponential are not 'UiLrred' by the tiiue-freu.iiejn:y representation, but are not strong enough to 
also guarantee that a linear chirp is not 'blurred' (see Figure 2,7b], Thia additional condition is 
met uiiHjcflly by the Wjgmjr d Lstribution ■ In -other wtjrdj, "* interpret perfect locality to mean that 
the signal transform does not spread the signal energy in any direction in time-frequency (not Just 
the rioriiontal and vertical dkectiona]. We postpone a more thorough discussion of this point until 
Section 2.8, wh*n th* necessary malhemiticat machinery *HU he introduced, 

This aeauiuplioD is Hot *exy rwstl-ktive Cit tbe form of the traoiiforiiL, Hince We can. always shift ^(£,u) 
in time and frequency to aatiafy it. Tbia shift, in turn, shifts the transform, in Lime sod fiequKJicy. 



$2. 5 r DftflfTfi criteria for join t time-fr ^ r M hctt re u^ rtf^n^ 



3J 



(a) 



Too 





!"l ■ 



iiiu« 



Figure 2.Q t Concentration ellipse* for tr ansfcrm focnafe faj JVon- direction^ JrerneJ 
fat, = o;,- the ea-ordlnate axes can be re-scated to make the concentration ellipse a 
circle. Thus viewed, the corresponding transform spreads the aign*J energy eviaBy 
m all lime-frequency direction* (b) Directional kernel fa £ 0> the co-ordinate 

axes cannot be rescued to ma*e the concentration ellipse a curie. The correspond- 
ing irurfra, always hm better resolution in some time-Frequency directions thm 
others. 



may be oriented obliquely relative to th* coordinate axes (Figure 2.0b). We shall 
call transform that satisfy the condition * u = on their tern*] non-directional* 
localized. This name is appropriate since we can rescale the co-ordinate axes to 
make the concentration ellipse a circle under this condition, Thus viewed, the 
transform spread* enengy uniformly in all directions in time-frequency. On the other 
hand, if v^ i- 0, then this does not hold, and the transform will be directionally 
localized, always having better resolution in some time-frequency directions than 
others regardless of the scaling of the axes. 

The analysis of the non-directional transforms is more straight-forward. We there- 
fore restrict our attention to this cafl e until Section 28, when we shall examine the 



S 2.5. Dt&j&n criteria, for joint time-frequency repr-esentat/OJlg 3(3 

more generaE case. We will see there that the principal results are essentially the 
same as non-directional case, suitably generalized. The analysis, however, ii more 
complex, and is thus best left until later. 

To summarije, given a non-directional transform (a^ — 0), a t and c w measure its 
degree of locality in lime and frequency. The smaller o-( and a w are, the better the 
time and frequency resolution. 

(CS) Smoothness: Similar to the stationary C&ae, different aspncta of the speech 
sigcml can arise at different scales in time-frequency. For example, voiced excitation 
can give rise to fine scale structure on the order of the pitch period in the time 
dimension and thefiindarncnta] Frequency in the frequency dimension. The forrnant 
structure* on the other hand h arises at a somewhat larger scale. Thus t one of 
the design parameters for our transform 13 the scale in time-frequency we wish to 
examine. Said differently! wc want the transform to be smooth in Lime-frequency 
to a given degree. 

This notion of scale can be be formalized hy measuring the distribution oFthe spatial 
frequencies present in F^t,^), I.e., the distribution of energy about the origin ofits 
2-D Tourier transform. Since Jj^(i N w)] = 9[t,u)A s {t,^} (Eq. 2.5.5], the relative 
amount of spread is determined by the choice of $(t, J>), which windows the time- 
frequency autocorrelation function. We c*[i measure this spread in terms of the 
variances 

and 



3 % 5- Design criteria. for Joint Urns- frequency representations 37 

JS™\*{T f v)\*dr4» 

where we assume that the center of mass of |*(r N i>)| J is at the origin, f These 
determine the covariance matrix and the associated concentration ellipse in the 
[T t u) plane, 

When E r „ = D t we call the transform aoa-dsrectiontitiy smooth. In. this cam, it ia 

posEible to rescale the co-ordinate axe* to make the concentration ellipse a circle, 
and thus viewed the transform smoothes the signal in time-frequency uniformly in 
all direction in time-frequency. On the Other hand, if E r „ ^ 0, then thia does not 
hold, and the transform will he dincttonaJly smooth, always smoothing more in 
SOinc lime-frequency directions than others regardic&s of the Seating of the axes, 
Just like the locality condition, we will restrict attention now to the non-directional 
transforms. We consider the more general case in Section 2,8, 

To summarize, given a non-directional transform (£ t „ =0 ], E r and E* measure its 
scale in time and frequency. The smaller E T and E„ are, the larger the selected 
scales. 

Observe at this point the parallels between the stationary and non-stationary anal- 
yses. Tf we think of the Wignar distribution as the non-stationary analog to the raw 
power spectrum, then the time-frequency autocorrelation function (the Wigner dis- 
tribution's 2-D fourier transform) is the 3-D analog to the autocorrelation function 
(the power spectrum's Fourier transform). Further, windowing the time-frequency 
autocorrelation function smoothes the Wigner distribution, just as windowing the 

T Tllis UtUJuptinn wi]| b* tPI* if the IrajiaCw m i* TUl 



S2.fr Relations among the desizn .cr i teria |S 

autocorrelation smoothes the raw ipectmm. In both cases, the design decisions 
for the resulting transform require selecting a convolution kernel that satisfies both 
locality and smoothness requirments. In fact H we shall aee in the next chapter that 

the analogy is even doner. 

2,0. Relations among the design criteria 

The various design criteria fur Olir time- frequency energy representation arc not 
independent- We shall Btattthe important relationships among them in this section. 
Throughout this section we assume that the input signal x{t) is finite energy (i.t., 
3*^2) and that F a (t, u) is a quadratic transform of the signaL This means that 
-Fi(t,w) — {Tx,x} where [x t ]/} - / x{ Ct) U* (or] da and Tt^ is a (bounded) linear 

—CO 

operator on £a. 

• Shift-in variance & Posit ivity; Together these imply that the transform can 

be expressed as a superposition of spectrograms, f 



Theorem A. Let F E (G T w) he positive and shift-invariant. Then it has the form 

F f (t,w) = j 5,(t,wiff p )d« p (2.6.1) 

— at 

where S x (t,w]g) is the spectrogram h&ving $ as its window. 

Proof; The posithnty of F^t, w) means that Tf,^ is a positive operator and therefore 
has a square root A t i.e. t 

F K = {A*Ax,x} = {Ax, Ax) = \\Axf l (AA) 



f B*nifliir:hft. c-t al |IB7B| incnrrcctly ulatc itit & poAitLv* ind flhifl-LDVB.ri*i« quadratic tr*TlBfi3nn H 
necessarily a jpeclrogiariL. CkaacJi i M (KkltB brill ler fl9ft4| point Ollt thif flFTOT, mtntkunillf thit 
linear cnitibiuaLiirtLS uf u-|jeLlrugraifift limt b* included. 



39 



S2.6. Re l ations among the denien criteria _ 

where | \x (a) ||* = / \x(a}\ 3 da [see Rudin 10731. ^presenting the linear operator A 
in terms of its impose response A t ^\x{a)\ = /i[r T et;( lW Jj(r)dr and substituting 
into Eq. A.i gives 



7 ? 

F*[t t u)= I J k{ ai T-it,u)x( T )dT 



da. 



{A.2) 



-w l-oc 
By time and frequency sfrift-invaTiance, 



Setting t = uf = gives 

or h with ffll (r) = JL(a,r;0,0}, 



-00 -DO 



« 



Jq. 



-W h-W 



— oq Hoc 






da. 



From Eq. 2.4.1, We see the outer integrand is the Spectrogram S t [t ^iSa)-, Siting 
Eq> 2,6.1. /// 

• PosJtivUy & Superposition; The next theorem shows that poaitivicy impiies 
superposition. In fact, it implies a strong Form of superposition, &s in Eq, 2,5.6b. 

Theorem B, If F s (t,w) is positive^ then 



\F X+V {t,w) - \F r [t lU ) 4- F v (t, u)f| 2 < ^f^wJJVtlpUi). (2.6.2] 



jj&fo jRe -Ja iionB jjnujjjf the dtSJXfl cri teria. t ___ 40 

Proof: From the elementary fact about inner products 

iiP+*ir = ijpir+2fl e (p l9 }+iMi i 

it follows that 

|lF+fll a -[bf + lkll 1 ']r = '*l**fc^P 

Since {p t q) <\\p\\\\q\\, 

<*iwrn*ir- 

Substituting p — Ax and g — Ay above and using Eq r A.l gives Eq, 2,6-2, /// 

If the transform, is real;, the converse of this theorem is also true; i.e., superposition 
implies either J", or —F r is positive. 

Theorem C. Let Fiff.,^) be res/ and satisfy superposition (Eq. 2.5.6a). Then 
either F a (t a cj) or ^F r [t t (u) is positive. 

Proof: Step 1. First we show under the hypotheses of the theorem (Tat, at) — -> 

Superposition s&ya 

{Ts P 2!)(r VT v> = =* (T{x + y),x + y) = (Tx,x) + (Tjf,v). (C.l) 

Since the form (Tx,x) is always real, {Tx,y} = {Tjj s ;e}\ so 

{T{x + k ),x + jj = (T*, £} + 2 JZ«(rx T ^ + (3V y). 
Thus, from Eq, CI, 

{Tx, e) (!Tjm*) = =► Rz[Tz,y) = 0. {CM} 



§.2.6. Rel&tlQIiS among the cfejgjgJi criteria __ £] 

SubEti hiting iz into Eq, C.2 shows that ImiTx^) — also, bo that 

Suppose that <Ts, at) = 0. Then by Bq. C.3, {Tx t tf) = fop all y r If we let y - Tx, 

then (Ti, Ti) = and thus Tx = 0, as desired. 

Step 2. We now show that {Tz,z} — => Ts = implies ±T is positive. Suppose 
{Tz, jc) > and {Ty, y) < 0. Let * = kx + v where Je is real. Then 

This is a quadratic in Jc, and since (Tar, x) (Tjf, y) < T it hag two distinct real zeroes. 
However, since Tx £ 0, Tz = kTx + Ty has only one zero in k. Therefore, there 

exists a value of k such that (Tx t z) - but Tz f. f contradicting the hypothesis, 
and implying ±T Lb positive. /// 

This last theorem shows that we can replace the positivtty condition (C2) with the 
sole requirement that the transform he always real, and have- an equivalent set of 
properties. In other words, the transform will necessarily he positive if superposition 
holds, and if positivity is abandoned, cross terms will necessarily prove a problem 
for multi-component signals such aa spEech, 

• Positivity tc Locality: The positivity condition places a limit on the time- 
frequency locality of the transform. When the transform is positive, it 3s gome- 
times convenient to measure locality in terms of the variances, of 4t(t,w) instead of 
|^«w)| ? . We define 

and 



§.2.$r Jteiatkms ampnjr the desren criteria £2 

where we assume that the center of mass of ^(f , w) is at the origin, t When the 
transform is positive, we claim that th^ss variances are non-negative. To show this, 
first suppose the transform la a spectrogram. Then $[i, tu) is the Wlgner distribution 
of the Rpetiogram window g[t)i and wain& Eq. 2,4,3.., it is easy to see that 

4= varljKOI* an * ffft = var|Gr(w)j J p [2.6.4) 

which are clearly non-negative [d. DeBruin], More generally, if the transform is 
positive, it follows directly from Theorem, A that 

d£ = j c a var | g a (*) | a da and o£ - / tf& var | G a (w) | 3 <fo (2 J,5) 

where 

Ml 



/ IftrWI** 
** = —ST £ 3 - 6 -*}- 

— oo 

These are again non-negative quantities. 

Eq. 2.D.5 ahowa that o\ is the (weighted] average window variance in the represen- 
tation of F m (t, w) as a superposition of spectrograms. Since a spectrogram's values 
at a given time depend only on signal values under its window, we see that a positive 
transform at a time t effectively depends only on signal values within a few ct of t. 



f Tfakt aHuiUSiption ia flect=!**ry for the krm 'variance' Id apLtlj". Jt if not H«C4SBUV h twe^er, for the 
uncertainty relaticm ■ pmac r.-jed tcluw tu tic tru-e |tf, DelSruuii, 

' This is & stronger notion of trrnc Incauty than in the previous flection. Thais, time Locality essentially 
lYieiS-Tired h^w thn traiutform ipread a.11 impulse. The Wigner distribution id p*rfinr.tly localized in 
this senile, becaube it rifpreaeutfl the energy of *n impure at t:me 1$ entirely on the VertkaJ Hue 
1 = i(j in the t inLe-Ij-KqueiiLV 7>ldUe Thia doe* not mean that the Winner dn-tributiaSl's VB,tue* at 
time tq depend only on the signal value at t . Quite tlie uppoait* is true, they depend on the entire 
HjjjjLil. flu fact, the Btgn*] ^an he recovered from the Wigaer dutriblKkOll'l values *t any fixed time 
to (up- to & multiplicative cnn»t»nt) |see Cbu&ell Jl MnI lenh-raijlker l3Rrjn.|.) Hciwever, whea the 
transform L* p-nnLUve thcie lwu nulirHo* of locality coincide, 



S 2.6. Rations among the degjjgn criteria 43 

The next theorem states an important uncertainty relation for positive tran&rarmE. 
It bound? the simultaneous time and frequency resolution that can be obtained by 
such a transform. 

Theorem D. Let F 1 (t i tif] be positive and sftift-ijt^arjsnt. Then ctvq > |. 

Proof: From Eq. 2.6,S t 



F f ff fl = / wl ^ a / c *El da 



where a^ = var|g a (f]| a and £* = var |<V a [u7)| 2 , By the Schwarz Inequality, 

The classical uncertainty relation applied to g&{t) Rives u a ^ a > i, bo 

since fe&da-= I Iran Eq. 2. 6.6. Talcing square roots yields the- desired result. /// 

• Locality &; Smoothness: Just as in the stationary case, locality and smooth- 
ness are conflicting properties. Greater smciothne«;5 means poorer locality and vice 
versa, other things being equal. Thia follows formally from a two-dimensional gen- 
eral LHation of the classical uncertainty relation. 

Theorem E, If F 3 (t, w) is shift-in variant, then qyEj, > | and ff u E T > j, with 

eqil&Uty m both these relations iff 



$2.6. Relit fata among t he design criteria _ 44 

Proof: First, we shew that a t T, v > \> Let A(*,t) = ^/^(f,w)e^ r (*w. Then 
*(r,lf) = 7[d-(l,w)| = f \[t,T)* itv dt. Appiying (he classical uncertainty relation 
to X(t, r) w r i r %. t gives 



00 \ f / BB &a 



\ nc -CO / VQO — «> J 

Integrating E.l aver t and using the Schwara Inequality 



« / oo « 



i/ (/lM^}l*rt/i*MI V ^J At 
-oo Vpp — m- / 



oo / do no 



< / / * a |AMt*de / * a |*(r^) 3 dr/ <fr 



— 00 von —SO 

■v. 00 W W 



< ( / / t*|A(<,0|>«»dT | | ^^(r.OpdvdT [E3) 



By Parser] 'b thereoin, 



y |A(t,r)| s * = i J |*{t,w)| 2 dw [EM 



-DO -Off 

»nd 



—00 



Substituting Eq, E.3 into Eq, E.2 yields 



— Ofl-'-flO \^« — W —CO— 00 



(E.4) 

Since //|^(i s u?)| 2 dtrfw = //|*[r t f)| 3 df <f^ K we have J < ff^. By similar rea- 
soning h ^ < ff(jS r - 



§2, 7, Satisfying the deafen criteria — the Gauss-fan transform 45 

D5 rect computation uf the variances flhows tha L if i^(( p w) is a 3-D gaussi an ( Eq. 2 . C.7) , 
then these inequalities are satisfied with equality. Showing the converse is some- 
what more involved. If these inequalities are satisfied with equality, then from the 
classical uncertainty relation and the proof above, it follow* that ®{t, v) IS Gaussian 
in each of its variables. In other words, 

♦(r p t,) a = i -l"M''-tfW] 

for all r and p t where a > and e > 0, Thus, a[i>) fZ + Hrf = c(r)i^ + d(r]. 

Setting f = and T = shows that b(p) = e(Q)v 2 +■ d[0) and d(r) = a(0)r* -+ ft(Q), 
respectively;, bq 

&W)t ' + c(O)* 1 + <*(0) = e(i> 3 + B £0)r 2 + t(0j . [EA) 

Twice differentiating this w.r.t. r and i/ gives a u (^) = ^(rj for all r and ^ thus 

they are constant, Taylor expanding a[v) and c(r), substituting into Eq. E.6, and 

equating terms shows that 

By the symmetry of the two domains, $(*,w) must have the same form. Together, 
thssft imply that 

A(*,r) - „-fc.< i -+*'*+'ft*'A*+M 

for all t and t-. Taking the logarithm of Eq, E-S t clearing of fractions, and equating 
terms shows that ti = 7, = 0, Thus, a"(0) = in Eq. E .1, which implies Eq. 2.6 J, 
as desired, f f j 



§2.7. Sa.tnfyifix the design criteria — the Gaussian ti-angJbjrn 46 

2,7, Satisfying the design criteria — the Gaussian transform 

From the last theorem, we see that a two-dimensional gaussian transform kernel 
fives the best time-frequency locality for a given smoothness. The resulting repre- 
sentation wj]t be called the Gaussian transform of the signal., f By specifying <r^ 
(= 2ffJ) and j^ (— 2ff£,) for this kernel we ar*^ in. effect, selecting a particular time 

and frequency scale for thfi transform, We may chooRe any values we wleJi provided 
ffr^n ^ j (positiVity), and the resulting transform will best satisfy all our design 
properties. The result is clearly a general! eat ion of the solution in the stationary 
case, where a gaussian convolution kernel of different sizes selected different spectral 
scales. 

When (J^fffl = j, this transform L9 equivalent to a spectrogram nsing a gaussian 
window. For larger values of ^r^fti thic transform, is equivalent to convolving such 
a spectrogram with a 2-D gauseian. 

As i note on its implementation, this last fact was u^ed to compete the figures 
below. A more direct method would be to compute the Wigner distribution and 
then fjerform the 2-D convolution specified in Ftj. 2,5.2. This is not very efficient in a 
digital implementation, however, since the Wigner distribution has to be computed 
at high sampling rates to avoid aliasing. * 

By performing a convolution on a spectrogram, far fewer time and frequency samples 
need to he computed, since the spectrogram, is already a smoothed version of the 



f W-n Hnvfl c.hciaen thu name far obvinus reman,; TJi Lu risks, Ildvmvh, cuiLfuuiuu willi tlie Ganss- 
Writmtr^Kn tr.xn.ifnrmjijnn \mct- HLHe 194B). Io fact, the GanaaL&B transform of ike fligruus(() k tiie 
twri-^iniiMiPiirjriL (riuM-WEJervtrus traiifnrinatinii of the Winner diulriEiutiuhi H^i(f) ;nee De- Bruijji 
IMT[. 

* In getier&3, the Wigaer distribution mupt b* sampled in time *t twice the- Ny*jii?t r**e qf *Jm *ifjTin.t 
|CIuhd 4i Mac kfen bran Iwr I96QH 



§2.&. TttTp^tlon&i tinie-freiiueiicY tran sforms 47 

W-gner distribution. Further, fline-e the gaussian kernel fa uncorrected in time 
and frequency, the 2-D convolution as separable, and can be performed as separate 
1-D con volutions in the time and frequency directions, resulting in a relatively 
inexpensive computation. 

2+8 ♦ Directional time-frequency transforms 

So far, we have aEsumed that the time-frequency energy representation was non- 
directional in the sense that the covariances & tu and £ f6 > of the transform kernel 
were both zero, We shall now examine the consequences of Lifting this condition, 
We begin with an example. Consider the two transforms specified by the kernels 

*i((, W ) = ^-^ 
and 

These transforms have identical u\ and c Ui but differ in the sign of a^ Figure 
2.10 shows their concentration ellipses! arid Figure 2-11 gives the transform of the 
chirp e'? f for these two cases. Notice that the second transform broaden? the chirp 
much more than the fiist T which should be evident from the concentration ellipses. 

The opposite would be true for the chirp e~*K These transforms are direc-tionatfy 
sensitive, and using &t and G^ an the sole measures of tLme-free.ucn.cy resolution is 
obviously inadequate Ln such cases. 

Why consider transforms, with such behavior? One answer fa to provide a general 
treatment of time-frequency locality. Another answer is that it is evidentty possi- 
ble to obtain better time-frequency resolution for some signals if the transform is 
directionally l tuned' to them than otherwise. This would mean that, in general. 



§ 2. 8. Dime t fonaJ t i me-frag uenc y transfer ms 



M 



1 



(*) 



t (b) 




Lime 




tLmr 



Figure 2.10. Concentration ellipse* for transform kernels with co mplemcni&ry 
orientation selectivity, (aj Concentration ellipse for 4>i { ^ w) = e"(^'~' w+ J. (fcj 
Ccmcej]tra*fon ellip^ For ^jfi+w] = e -(" +I1J + l ^) r 



t( a ) 



t-iuie 




tlOlf: 



Figure 2.11, IWtfetJt&naJ tranfl/brnw for a linear chirp c 1 ? 1 , ^ Trajis/arin fiair 
kcrnel in Figure 2,10a- (b) Tmnaform iias kernel in Figure 2,10b. The seoncd 
transform broadens this chirp much, more ih&n the first, which should he evident 
[ram their concentration ellipses. 



S2.& Directions! time- frequency transforms 49 

we would need a family of transforms each tuned to a preferred time-frequency 
orientation. 

The theory of directional transforms is greatly simplified by a rotation of co- 
ordinates, Let 

be the operator that rotates a point 6 radians in the time-frequency plane. Given a 

time-frequency representation /",((, w) of a signal x(t} t we can consider the rotated 
representation formed by the composition i^ifyft, w), I* this the time-frequency 
representation of an actual signal? The answer is yes; it' 

» 

vt{t) = - f c~5~ / X(u7)e' [— 5~ + 5FT7trf Wl [2.8.2) 

2-jtV CDS 9 J 

— OD 

then W S( = JV,ify (aee Van Tre*s 1971]. So it F x has the kernel ^(*,w) and if G % 
has the kernel $(t,w)R$, then tr*, = .FjJV In other Worda 1 Eq„ 2.8,2 rotates the 

signal by 6 radians in time-frequency, tlius the transform with the rotated kernel 
applied to this signal will give the desired effect. 

Relative to these new co-ordinates we can generalize some of the measures of the 
previous sections. For example, consider 

aa 



-aa 



This IS the marginal of the rotated transform along W, It follows that the time and 
frequency marginals (Eq. 2,4.2) of F x (t, u?) satisfy tti ■= irj = & and jtj. - '£ w t p _ „. . s . 

If JTj(t) = {£*[(-) I** then we will say that the transform preserves the marginal 
relative to the direction 9 in time-frequency. Interestingly,, the Wigner distribution 



jjS.& Directional timer Frequency trang/brms 50 

uniquely meets thi» requirement for all &, The proof la a simple generalization of 
Cohen's result. He ihowed that ft ehift -in variant quadratic transform perserves the 
time marginal, i.e., r t [t) = |a [(]!*> iff *(r,G) = 1 for all r. Using 7\frRi\ = $R# t 
which is easily verified, it follows that r t {t) = |xj(t)|* iff *JZj(t,[J) - 1 for all r. This 
LinplieE thai *(r, i') — 1, which corresponds to the Wigner distribution by Eq. 2.5.3. 
This is the reason for considering the Wigner distribution 'perfectly localized 1 and 
$(t,U?) the * point spread Function 1 in time- Frequency. 

The amount of spread in time- frequency direction 9 can be measured by the variance 

/ / \+S£ l {t t u)\*dtdu 

— ao — oa 

In the notation of the previous sections, at = &g=o> e w ~ vf-itji* *"^d 



Lee ffj hft the maximum value and o\ be the minimum value cj, which corresponds 
to the eigenvalues of the covarianee matrix in Eq. 2,3.5. Further,, let $* be the max - 

. m of the eigenvalue 

St H- fj J 

<r^. In other words, &i and *7 3 are the maximum and minimum dimensions of the 
concentration ellipse of ^(t,w), and 0* \b angle of the major axis of concentration 

ellipse relative to the time axis. These three quantities conveniently specify the 

time-frequency locality of the transform,. 

In an analogous manner, we can measure the smoothness of the transform in timc- 

frcqutricy dirutliuzj <J bv 

qo Da 

Ej = =^F^ ■ (*«*) 

— M — EO 



§2,3. Directional ti me- frequency fraiigforrng 51 

In the notation ol" the previous sections, E? = E^g, E* = Ej wr / 3! , and 

*-<-' ■**>(£ %)(rJ) i^-n 

Let E* be the maximum value and E| be the minimum value of SJ^ and let 0" be the 
maximum direction. These three quantities conveniently specify the time-frequency 
smoothness of the transform. 

We are now in a position to generalize Theorem E. 

Theorem F. If F x [t,w) is shift-invastAnt, then oiEa > j znd o- 2 £j > |, with 
equality in both these reiations iff 



Proof- Applying Theorem E to the tniuform with kernel $Rj* t Wfi have J < 
*3^r < craSi. Similarly, with the kerne! tfJfy«, j < <j t E 3 < <T|E a . The lighthand 
inequalities are satisfied with equality iff Q" =6**. It follows r™m Theorem E that 
Eq L 2.8.8 is a necessary and sufficient condition that all these ineqiialttea arc satisfied 
with equality. Iff 

Generalising Theorem. D requires that we UM the directional variance of $(t,u) not 

J J t*$R^{t T w}dtdw 

oo E5 — ■ (2,8.9) 

-Q0 -00 

We define cr f arid ffjf a* the maximum and minimum values of this variance, and 
©* as the maximum direction. 



$2.9L A awci example ,_„ *2 

Theorem <5. let F a {* h (ij) be pfisitjve and sJMft-ilnrarjant. Then ffjOjj > j. 

Proof; Apply Theorem D ta the signal JC_e* (.<) and the transform with kernel $R#l . 
/// 

Corollary, IfF?(i, w) is positive and $hift-inv*rhnt r then 



"2 



t °tw 

.2 



ff(tf ^* 



From Theorem F, we Bee that a two-dimensional gauusian transform kernel gives the 
best time-frequency Locality for a given smoothness. In thfa general rasa, however, 
the gaussian teniel may he correlated In time and frequency, Lc. its concentration 
ellipse may be oriented obliquely in the time- frequency plane. 13 y specifying o'j 
[= Scf), v\ s {— 5of ] t and &* for this kernel we are, in effect, selecting a particular 
time-Frequency scale for the transform- % Theorem G, we may choose any values 
we wish provided cj vn > j, and the resulting transform will best satisfy all our 
design properties. 

When VT&U — jr, thiE transform U equivalent to a spectrogram with a rotated 
fianssLan window g@*{t) (cf, Biley 1963, Dungeon 1984]. For larger values of <rj<rjr fl 

this transform is equivalent fcn convolving audi a spectrogram with a 2-D gauseian- 

2.9. A speech example 

In this section we examine a particular utterance, comparing the various signal 
representations discussed above. The utterance ia /wioi/ taken from "We owe Eve 
a dollar 11 , as produced by an adult male. This utterance has some rapid F2 motion, 
which makes it useful as an example of non-stationary behavior In speech. 



'i;2.9. A Jiraecfli example 



53 




«*i 




(*) 



(b) 




I t M 



I | I I P . | I .' I I | I I I 1 | P 4 I I | I I I . | . . . . I 

i_i i.ii fj t.ii i.j ■.» i.l 



w 

Figure 2.12. Log magnitude spectrograms of the utterance /wioi/. (&) Wide- 
band (g&ussiaji window siAndard deviation of J msflcj. (h) Narrowband (standard 
deviation of 15 mate), (c) Tn ter mediate bind (standard deviation of 4 msec). 



62.3. A speech exampie . _ H 

Figure 2.12*^ show the traditional wideband and narrowband spectrograms for this 
utterance. These are spectrograms computed with gaus&ian windows of standard 
deviation. 1 msec and 15 inaecfl h respectively The wideband spectrogram shows 
vertical striations spaced at the pitch period. The narrowband spectrogram shows 
horizontal striations spaced U the fundamental frequency, They are both due to the 
voiced excitation. Figure 2- 1 2c shows a spectrogram whose window duration is 4 
msec, which is intermediate between the previous two. This window size is matched 
to the excitation in the following sense. The 2-D gauanan kernel [Eq. 2.6,7] that 
corresponds to this spectrogram hau Standard deviations of 2 msec by 20 Ha. These 
are in the same ratio as 10 msec and 100 Ha, the pitch period and the fundmental 
frequency, respectively. This choice gives rise lo rows and columns of sharp peaks 
and vaileys spaced at the pitch period and the fundamental frequency. We wili see 
in the next chapter why the excitation produces this particular structure. 

Figure 2.13 showa the Wigner distribution for this utterance. Compared to Figure 
2,12 it lookE aimost as if the vertical scale has ehanRed 1 but it has not. This repre- 
sentation is- dominated by cross-terms that give 'echoes 1 of the formante in initially 
suprising places, But remember that the eum of two complex, exponetials at differ- 
ent frequencies gave rise to a cross-term half-way between them that had greaLer 
amplitude than the original terms (Figure 2.&). Evidently, the Winner distribution 
itself gives a conFusing picture of multi-component signals such as speech. 

Figure 2.14 shows the time- frequency autocorrelation Function, the 2-D Fourier 
transform of the Wigner distribution, for this utterance in the neighborhood of 
the origin. Notice the repeated pattern in rows and columns apaced at the pitch 
period and the fundmental frequency. In Chapter 3 we will see that this pattern 
can be exploited in understanding how to suppress the excitation. 



S3.& A speech example^ 



,M 



Hi J 



|::B-,;:! 



j 

i 

"h:fn :..■- - ! ! 



1 I ifei 







Figure 3.1&. Loj magnitude of Wigner distribution. (This is imptamsntfid as & 
ps^iKlo-Wjjner distribution usijifl a jLjaussifiJl mndow of standard cicvJatj'on 4W i7Wflc 
fsw CJoascn &■ MecHenbrftuJter 1980b].) 



lit 




-"T 



~_* , h 1 i~_.: — i 4 + 1 




13 J ^|M"l|lllP||i|P|>ll"|llll|IIPI|'lll||P|||l"Fq||l. 



-12.1 -|| -T.I -I -II • i.f I T,l !• 



IHHGC 



-I* 
12.1 



Figure 2-14+ Log magaitud* of time- frequency autoeorreJatioji function in the 
vicinity of the origin, 



ij2..», A speech example 



56 



M 




l f,j I.JS u 1.11 i,4 



(*) 




FigllTC 2,15. Caufisian transform with kerne! scales chosen to suppress fhfl (■X'TJta- 
tj r on 1 tr t = 10 msec and o w — 100 JJs. fa,J 2-D proL fbj! 3-1? p/ot 



jj3,3. A speech example ^_^_ 57 

Figure 2.15 shows the Gaussian transform of this signal using a kernel of a scale 
chosen to suppress the Excitation- The pitch striation.q are remove^ leaving EmnoCh 
time-frequency ridges that, correspond to the formant*. The ridges ate quite sharp, 
although it 18 somewhat difficult to appreciate this in the ha]f-toned picture, Figure 
2.1£a r The 3-D plot in Figure 2,15b gives a different perspective oil this surface- 
It shows. Fl and 1 parts of F2 quite nicely, although moat everything above % kHz is 
considerably distorted in this presentation. 

Finally, Figure 2,16 shows directional transforms of this utterance using oriented 
Gaussian kernels matched to different aspects of the signal. In Figure 2.16a f the 
kernel orientation is matched to the rising F2. In Figure 2.16b, the heme! orientation 
is matched to the falling F2, These choices bring out the selected formant peak with 
high resolution. 

In this chapter, we have found that a particular time-frequency energy represen- 
tation the Gaussian transform* hest satifiea a set of properties deemtd desirable. 
There are several free parameters for this representation (ff fl &„, and $*) , which de- 
termine the scale and directional selectivity of the transform. Deciding what scales 
are of interest requires a more specific modei of the signal. In the next chapter, we 
adopt such a model. 



12.9. A speech exumpte 



SH 



{*) 



i"" 



ii in 



>«■ 



»»»- 



■ Ill- 



llll' 



in 




'/J ■» 



-n — i-i — n — i ' ' I — i i ' I i i i ■ 'n^ ^^ Ti | i i ' i ■ 



(b) 




"■■ -'l" 1 '! 1 "'I !-•••! i- ^ 

I l-.li I.I ill. i.J t.IS *■■' *■" *-■" 



Figure 2.16. DJrec t Jana J frajwforms ui-Jng urkiiSerf Gatissjaji kernels matched to 
different aspects of tfte sjgmij. fa) Kerjitf orient&tion matched to rising F2. (b) 
Kernel orient&tion matched to Falling F2. 



Chapter 3. 
Time-frequency filtering 



In this chapter, we continue the discussion of joint time- frequency energy represen- 
tations for speech signals, Hers we shall make stronger assumptions about l!]c form 
of the signals, We will Introduce a particular model of the time- varying vocal tract, 
and define its "transfer function', M(t, w). We will show that eime-freqrienry Eiteiing 
tan be used to estimate \H(t, u?)| 9 , a technique that is essentially a two-dimensional 
generalization of straight-forward, stationary methods. Furthers we w »h Hflft "=hat 
|if(tjiu)| is closely related to the time-frequency representations of the previous 
chapter, 

3,1 » The stationary case 

First, let us re-examine the stationary case. If we adopt a more detailed model 

of the generation of a stationary speech signal, we can say much more about the 
cepstral methods discussed in the previous chapter, The linear model (Fant 1960- 
Flanagan 1972] of vowel production begins by decomposing the speech signal into 
a vocal source component (e,g periodic vocal Fold vibration) and a vacaE tract 
component, which sue treated as independent. The vocal tract is modelled as a 
linear and quaai-tiffle-Envariant fitter with excess pressure and volume Velocity (of 



60 



63. 1. The a ta.tionuv case 

assumed one-dimjenftionat wave motion) being analogous to voltage and current in 
circuit theory. The diatTibutLon of the poles of the filter's system function consti talcs 
the fbimant description of the vocal tract. 

In other words, H{iw) , the transfer function of the stationary vocal tract, can be 
approximated by [Flanagan 1972| t 

N 

n=l 

where II n (9) eoruiata of ft simple pole at s n = ft, 4- iwk T 

ff ft (»w) = , | l ^ ■ - t , (3,1,2) 

(Ed - [€t m + Ml*) 

and fn i Q the residue at the nth pole, 

_ __ f n* "«*t * 3 j a j 

We. associate a formant with each pale, or more precisely, with each pair of poles, 
since they occur in conjugate pairs, i.e., d_ ft = S* nt given the impulse response of 
the vocal tract is real- The impulse response of the stationary vocal tract, ill fact, 

MO = E I* *« w + < ''-tOl . (« ■<) 

where 

11.(1) = e*« '*(*), (3-1-5) 

In this linear time-invariant model a it follows th&t the spectrum of the excitaLion 
and the vocal tract transfer function combine by multiplication in the power spec- 
trum and addition in the log spectrum. This fact leads to a simple procedure for 



} Tliia ifl the parallel form* Lit inn. The ncrial formulation, H{iw) = kj\ n H n [in>)JI- n {iiu] i* abo Gften 
used. Ths fwnvnr >* the parLia] fraction expansion of th* biter. 



§3-J, The Stationary case ^ gj 

separating the excitation and the? vocal tract transfer function in certain (idealized) 



CilHCH. 



Suppose the the excitation is an impulse train, which is a very simple model of 
constant pitch, voiced excitation. In this ea*e 1 the spectrum of the excitation is also 

an impulse train> and thus,, the Speech spectrum IS a uniformly sampled version of 
the vocal tract transfer function. If the sampling were unaliased (i.e., the pitch is 
low enough Telative to the highest transfer function currencies] the original transfer 
function can be exactly recovered by ideal low-pass filtering the spectrum, by the 
sampling theorem [Bracewell 19T8|. But this is just cepstral smoothing using, in this 
Very idealized cast, a rectangular cepstral window jOppenhfiim 19o9; OppCimeim 
& Shafer 1975]. 

Let uh examine this result more closely. The formulation here will be in terms of the 
power- spectrum and its transform! the autocorrelation function, instead of the more 
usual log spectrum and its transform, the eepstruin,, since the former generalises 
more easily to the time-varying case. Since the term 'cepstral filtering* is, strictly 
speaking, reserved for filtering operations on the log magnitude spectrum, we shall 
refer to analogous operations on the power spectrum as autocorrelation fittering- 
The results in the stationary case are similar in either formulation, -f 

If x(t) represents the excitation, h(t) the impulse response of the vocal tract, and 
y(f] the output speech signal;, then in terms of power spectra and transfer function, 



4 CEp.tr*] *jid auttKOrrtlatiaji filtering esn both be used to M pit at* sJ E iiaI ceHTtpcm*Eitfl that arisi; 
St diEzrr.nl «ah& in tliK frequency domain. Central filtering IB niOBD apju-Dpriate when the signa] 
comp^neriK combine by ccmvnlqtion in toe tirrJc durnain, autocorrelation Gltcrinn wheo they combine 
by addition- Botb spjjrDacln:ii ram be used fur Kpeecli, *jjice wt can ust eitber * wriaJ or patiLLcl 
■orjiLulaLmn of the voca] tract modeL 



HiL The ^tafcioiHtrv case 



|F(oj)| = |/f(£h/J| 1-V(u]l| P or in terms of autocorrelation functions, 

•X} 

A p [t) = j A t {t)A k {r - t) dt. [3.1-6) 

re 

Let the excitation be an impulse train 1 J(^jT) ^ Jjt^(i — kT). Then 

Mr) - ^ E *< T " * T >' (*- L7 ) 

T Jt=- ro 

Thus from Eq, 3.1.6^ we have 

A t {r) = %y,A h {T-kT). (3.1.8) 

Provided the duration of A^fr) is small enough that the terms in Rq. 3.1,8 do 
not overlap* A^t) and thus |ff(iw)| can be recovered by windowing. A v (r) with a 
rectangular window centered on the origin and of duration T (see Figure 3.l)- 

Let us examine the form of Afc(r), Assume for now that the vocal tract transfer 
function consists of only a single, pole, i.e., its impulse response has the form of 
Eq. S.I. 5. Then 

DO 









= -i-e -l f »^*, (3.1.9) 

ffn 

where # ft = — 2a n is the (half-power) bandwidth or the pole. Thus, provided this 
bandwidth is large enough, the overlap in the terms in Eq. 3.1.8 wilt be negligible, 
and windowing A r (r) wi]] very nearly recover Ajt(r) and hence Jif (tu)| . f 

f Tie phase of LKc Lraniifer function can kc foqnrJ, 1/ ^e^ir^H, from it* magllUlldfi, a UK* tbifl mode] ii 
JHUihu uui pliiie nee Oppen keLm 1: SEiafer 13-75). 



jj- l- The stationary case 



63 



1,1- 

t.fs-j 
*.i 

■ ■■ 



- 



, 



Jtl 



:M 



I IS I 



to* iih ifi» Jin 



T 
J .Hi 



I 1 " 1 " I 
ilH ifii JHI Hit J5l* JTH. Illl 



(a) 




its* iii* iih im jgrj* 4qp* 



(b) 



■■*- 

4 * - 



-• — r — i — | — i — "-T- - - 
t» 3*1 ?s* 



p I I I < < I 1 1 1 1 Y 



ojU 



Iffl ijxi mii irai 



MP 



.(c) 



fiM 



Lu, 



J ij ti f. 1 f i i i 



23|t 1711 JIM Jilt 



1 ' I ' ' ' ' I ' ' 
IW JfSf 



figure 3.1. Hoovering 1 the SrajL&jfirr function by autocorffiJatJOfl filtering, (a.) 
Spectrum of the exc.ita.tion modelled n$ sji impulse train (lO msec period), (o) 
Square magnitude of the transfer function, which in this simple example is a single 
pate of 3€0 hz bandwidth, {$) Power spectrum* the product of '(a)* and i (b) i . 
CepsiraJ filtering uses the iog spectrum instead. The approach here generalizes 
more e&sily to the time-varying case, (con tinned.,,) 



5 3. 1. Tha sf Afc&ftftar.y cas* 



64 

1_ 




-■ — r 








/ \ 



-| ■ ■- 



I.IJ 



W 




/ 



/ \ 



r* I f I i M | i i 



.- i 



-.» 



7<,l 



IM4 I1H 



I ■ I ■ ' I ' ' 
fit .-'.« 



rT T T 
U4I 



t-KTJTTT 



i»*t nil 



' l ' > ' ■ i ■ i 
mi Ji*» 



!<!■■ JTJ1 



- 1 
4**1 



(<0 



Figure 3,1 [ continued). Recovering the transfer function hy aijfcarnrrelkti'nn 
filtering, (d) Magnitude of the autocometat ion function, Che fjnverflipj fnurier trans- 
form of l (c)\ Dished lines show the rectangular window, (s) Fourier transform 
of the windowed autocorrelation fijnctwn T whicfl very nearly recovers the transfer 
fnucijoji '(by in this idealized case (the effect of the slight overlap of the terms in 
'(d)* is negligible). 



The analysis of the multiple pole case follows from superposition. Provided the 
poles are not closely spaced relative to their band widths, | 



■v 



|//f>0f * £ |z n | s [H n [iu)f + ^-.(wJI^ 



(3.1. Uj 



tt=i 



| Tbft nnalysis in iermn of Log apecira and «|Mtra doM Bfri J*quLr* this pr&vbso, aillC? convolutions ill 
the tame JniriiLii Irajiafofaft {exactly) w sums in the c^patr*] Jimubhi- This if m> advantage eft lie 
cepst-rat li j.' h 1 L '-' J- ■- 1 ■ 



§3.2, Nan^s tat fonf t ry vocal tract §5 

from Eq. 3.1,1 and Eq. 3 .1 .2, hence 

N I I s 
A*(r)«j;JM-^Wc« h r ll T > (3,1,12) 

from Eq. 3.1.0. From this equation and Eq. 3.1. fl, we see that windowing the 
autocorrelation function of the output speech signal can still be used to recover the 
transfer function when the bandwidth* are large enough that aliasing if? negligible. 

A few changes to this mode] make it more realistic, First, the spectrum of constant 
voiced excitation lb somewhat better modelled as an impulse train that drops off 
at 12BB per octave [Flanagan 1972;. Thi H trend can be removed by spectral pre- 
emphasis. 

Second, the sampling is usually significantly aliased, which is a more serious prob- 
lem. In this ca*e s we can recover only a low-pass version of the transfer function. A 
rectangular window is a poor choice in this case,, since its transform rings far a con- 
siderable duration in the frequency domain, The gauraian is a good choice, because 
it has minima] bandwidth for a given window duration, as indicated in the previous 
chapter, (see Figure 3.2). Typically, the standard deviation of the gaussian window 
is selected about equal to the pitch period. 

3»2* Non-stationary vocal tract 

Let us now consider the case where the vocal tract configuration is not necessarily 
static, The goal 1* to recover the "time-varying transfer function* of the voca! tract 
From the signal and remove the excitation, as we did in the stationary case. 

Unfortunately, there is no widely accepted, satisfactory definition of the transfer 
function for a time- varying linear fitter, although there have been many proposals 



li'L£ Nan-atAtionnrv vocal tract 



66 



!■■-] 1 






- 




































































































































































i.I] 










































J 




















































































































































































































































































11- 
























































































































































»,»" 


1 




1 




III 










il 










. 1 . 










































,_ 




1 


■ 




.'■■■ 




• 11 




,HA 




1*11 


■Ml 




If** 




Ml 




*«» m* 


7-C1I 


■"7-5I 


1111 


Jill 


Mil 


JT11 


41H 



(«) 



l.l- 

fr-T*" 



1 13 



■' 




II I ■ I ~ , , ,..T1| I M ' ' ' I I 

Ill 1M TM IHl UU Lit* 1*1* itlt fll* "*■ "" »** *»* «» JT " *"* 



(b) 



11 

in 

■ h 

■ 






■ i i i i ■ i i | i ■ i > y i ' i | 



-+~*-. 



1 1 1 li < i |f i U i A 1 1 



> < < ■ 1 1 i ' 1 1 



IN Tl* |M4 Lilt IS» IT1* 111* JM* *™t "I 



r? ■' l ■ i ■ i p | 

I HI! J131 J£*» 



i i | ■ i ' ■ ■ — 



]?Ft « 



(c) 



Figure 3. 3. Estimate 'aliased* transfer /unction, fa J .Spec* rum of ejreifcafion 
modW/ed as an impulse trairt (iO flLSW period), fn^ Square magnitude of I he t rajisfer 
function, a single poie of 1SG Hs b&ndwidth. Thh hnx higher r queFrencies' than 
the prevjous example; '(*.}* undersa/np-fes it in this case, fcj Power spectrum,, tic 
product of c (a)' arte/ "fbj*. (continued ,. L ) 



53-2. iVfrfl-alaijojiRry; yocai tract 



67 




/ \ 



x. 





— I <-^-r- 

-t.tt 



-*.*i 







(d) 



i > 



■ n- 



■ - 




nr^^^^^^^^ 



=n-p 



i i f I | i ii — r-i — r-r 
* Jit HI Til 



^P"T^*^^^^^^^ I p I I ! ■ f ■ I F I f 1 I 
■ III IJII 111! llJl 1FH 211* Jill 1TII II 



I I I I I ■ * I ■ ■ ■ ■ | ' ■ ' ■ 
JJ1I ]|« JTH Mil 



M 

Figure 3.2 (continued), Estimafiiig 'ajjased' transfer function, fdj- Magnitude 
oi" the auiocorjefatibn function, tie (inverse) Surfer frans/orm of 'fa)*. Dotted fine 
show tie gaussian window, (e) Fourier transform of the windowed autocorreiation 
/unction, HrAkii recovers a tow-pass version of the transfer function '(b)'- 



[e.a> t see Lui 1971; Loynes 1968; Page 19S2; Sakh & Subotk 19B5' Zadeh 1M0]. 
We shall avoid this difficulty by constraining the rorm of the transfer function; we 
shall allow non-stationarity, but only in certain well-behaved ways. 

The vocal tract, of course, is not an arbitrary time-varying filter; it is constrained 
by the physical properties of the articuiaiors. Joshaf 1982 „1 984] has investigated the 
physics of the non-Stationary vocaj tract analytically^ and found that under certain 



.2. ffan^BtAtionstry vo c al tract 



reaEOji-ii 



iable physical asaurnptione it is possible to generalize the notion of A formant 
to th« time-Varying CMC. Essentially, he replaces the assumption of a. static vocal 
tract configuration by the assumption that the deformations are slow enough to 
satisfy the condition of adiabatic approximation, which he indicates appears to be 

generajly valid from cine X-ray meaaurements- 

We tail thus define the impulse response, k(f.,ft) h for a time- varying "resonance* of 
the vocal tract to an impulse, ${t - a), at time a as; 

n(t s -i) = ^' l_aA *****■»] Jr u(t - a), £3,2.1) 

where we ajuume the formant bandwidth ft, ii fixed, and the formaiit center fre- 
quency ia as® at J = 0, Note that Eq. S.2.1 reduces to the usual definition of the 

impulse response of a furiaant if the time-varying modulation frequency, ^(t), is 
zero. 

In Josha'a model, the bandwidth varies somewhat with fate of change of vocal tract 
area, which we shall treat as negligible. Regarding these bandwidth variation Fant 
[L9S0] believes they "...are of academic rather practical significance. Of greater 
importance is probably the mere fact that a rapid transition of a formant creates a 
special perceptual * chirp 1 effect^ 

It will be convenient to examine a more general class of impulse responses than in 
Eq r 3.2,1. Consider the impulse response 

Jfc&o) = V* - a)^£ 1<rJ * (3.2.2) 

where h- [t) w the impulse response of a linear time-in variant (LTI) system and 
l[0) = 0. Eq. 3,2,1 has this form with fco(t) = b^- 8 * 4 ** )*«(*). We call this a 
frequency- modulated nllef. We shall study this kind of filter in the next several 



53,2. JYon-statJonary ypgaf tract ,___ 69 

sections, since tt is possible to generalize the notion of a transfer function for It 
and it 1* possible to estimate this tranEfer function by generalizing the ^tepstral 11 
methods described above. Of course, an FM filter models only a single po~.v\ we 
shall take up the multiple pole model of the complete vocal tract transfer function 
in a later section. 

How then can we represent the time- varying transfer function of an FM filter? An 

intuitively appealing candidate is 

B{t t u) = H Q [i{w-^t))] t (3,2.3) 

where Ho(iw) is the transfer function of the corresponding stationary filter with 
impulse response ho[t) (Eci. 3,2.2]. In terms of how we might want to visualize 

the transfer function of an FM filter, this seems attractive; it is just the stationary 
transfer function shifted at each time by the local modulation frequency ffi). For 
a time-varying forrnant pole, ff (j, w) would have the form of a stationary pole in 
each frequency cross-section with center frequency ^ + -j{t) and n^ed bandwidth 
ft- 

Foe our purposes, the most important properties that the definition of the time- 
varying transfer function of a formant should satisfy are practical ones — it should 
provide phonetically relevant information about the signal, and it should be com- 
putable from the signal. The representation in Eq. 3.2.3 satisfies these properties 
since it is a simple generalization of the stationary case, which IS already understood, 
and it can be estimated from the signal by methods we will describe shortly. 

The transfer function of an LTI filter, however, also has some nEce theoretical prop- 
erties that would be desirable when generalized to the time-varying case. In partic- 
ular, the transfer function Ho{iw) of an LTI filter, y(s)=Tn[x{t)\: (1) specifies, the 



§3,2- HoiL-abmiiQB&FV vocal tract 



?u 



eig«LVail»P- for the filths eigenfunctioms i.e.,. 

m***i = mm**-* $**) 

and (2) ia the ratio of the spectrum of the output, over the spettum of the input;, 



IrPi i 



* (M = ^. (**•*) 



The firat property does generalisus to the FM case. Consider the functions 

These ajf- the cigenfunctions for an FM Biter T t with impulse response defined by 

Bq. 3.2.2. This follows from 

m 
T[p u {t)]= j h[t,a.)p u (a)da 

-OS 
-00 

— 3D 

Further, we see from Eq. 3,2 .T that Jf {t\/} specifies the eigenvalues for the eigen- 
f unctions fcJ w (f). The value of ffofitu], however, depends on the chnk.e nf the time 
origin. More generally, 

T[p„(t)] = H{Q,u) v *{l) (3.2,S) 



53-3- Tj'mfrfrequeflcy filterine ji 

La time shift-invariant, where H[t,w) is defined by Eq. 3. 2 -3, * 

By eompariEon, Home authors have used 

tf (t, u) = J h{t, .) e -«l'-> da (3.2.0) 

as their definition of the time-varying transfer function [e.g., Zadeh 1950J- The 
filter's response to a complex exponential e*** h ff(* i w)e*' t . HowEver, e* * is not, in 
general, an eigenfunetion of a time-varying system, consequently Jbf {( t w) has limited 
use. 

Saleh & Subutic [1985] have explored generalizing the second property (Eq. 3.2.5) 
to the time-varying case, They suggest using 

aa the definition of the time-varying transfer function where F x [t t w) and F r (t, wl ar* 
joint time-frequency representations of the input and output signals, respectively. 
The difficulty with their approach la that the ratio in Eq. 3.2.10, in general, will 
have different values for different inputs x(t) for a given filter, unlike the LTI case 
(Eq, 3,2.5}. This second property Evidently does not generalise well to the time- 
varying case. 

3.3. Time-frequency filtering 

The remainder of this chapter is used to show that time-frequency filtering: can 
be used to estimate the transfer function of FM filters and p more generally, of the 



t It., JHippD*? t - t - r. Lei JT(r,cu) and B^iw) he the inn*- varying tiiiiifer funciLoq arwa the 
efltreHpojidLng LTI transfer function, r«p-octiv*ry, in the new lbn» co-ordinat*. "Then, J?(F,cj) = 



53.3, Timer frequency Elttrint ii 

time-varying vocal tract. Time-frequency filtering consists of multiplying the time- 
frequency autocorrelation function A*{t,v) (Bq. 2-B-S} of the signal x[t] with a 2-D 
window #(j", u). The 2-I> inverse fonrier transform of this windowed function, 

become" thE filtered time-frequency representation. The shape of the window, of 
course, determines what energy 15 kept and what is removed in the filtered r*pr*- 

sentation [cf. Flandrin 19B4\ 

This technique is In. many ways the time-varying generalization of the "oepstral* 
methods presented in Section 3.1. The time-frequency autocorrelation takes the 
place of the autocorrelation function, a 2-D window the place of a 1-D window , and 
a 2-D inverse fotirier transform of a 1-D fouricr transform in this general] 3&tion r 

The representation in Eq. 3.3.1 also specifies a general member of the quadratic 
transforms presented in the previous, chapter, indicating that the two chapters are 
related. Tn this chapter, our goal is to show that a member or this ciass can give a 
good estimate of the time- varying "transfer function 1 * af the vocal tract. Happily, it 
turns out that the form of time-frequency window *(r, u) that gives a good estimate 
is a 2-D gaussian, which is the same as Eq, 2.6,7. In other words, we end up with 
the same kind of time-frequency representation as in the previous chapter, which 
was based there on weaker, but more general goals L 

The results of this chapter, then, reinforce and reinterpret those of the previous 
chapter. Further,, the analysis here suggests- which scales to choose, decisions that 
were free parameters of Chapter 2. In particular, for voiced speech, at is matched 
to the pitch period, and u ljf is matched to the fundamental frequency. 



S3^. The gt a f ionary case — re-examined T3 

We have just given the basic result of this chapter. It remains to denKmEirats jt=. 
validity, i,e,, that this kind of filtering will Rive a good estimate of the time-varying 
vocal tract "transfer function*. This requires several steps in which we gradually 
generalize the form of the filter that models the vocal tract. In Section %A t we 
re-examine the stationary caw, this time in terms of the tiiM-frequcncy auLocorre- 
latioiL function, In Section 3.5 N we consider FM filters that have a linearly varying 
modulation frequency. In Section 3,S, we use a locality argument to generalize these 
results for quasi-stationary filters and for FM filters thai have a smoothly varying 
modulation frequency, Tes-pectively, In Section 3,7 r we use a superposition argument 
to treat the multiple pole case. 

3,4. The stationary ca§e — re-examined 

So let us assume for now we want to estimate the transfer function of a filter that is 

time-invariant. We will show how the time-Frequency autocorrelation function can 
he used to produce this estimate, 

This will reaJly just be recapitulation, of the stationary argument presented in Sec- 
tion 3.1. In fact,. .Aftf^O) — jt^(r) n so we see the correspondence is very close. 
But with the time-Frequency autocorrelation function we will be in a posiLion to 
generalize these results to the time-varying case, so it is worth the eEFort. 

Letting x[t) represent the filter input, h(t) the filter's impulse response, and tf[t] 
the output, we have 

-06 

In other words, the time-frequency autocorrelation function A y (r r is) consists of the 
convolution of A x [t, t/) and A^t, v) along the r dimension. This is analogous to 



S3.4, The stationary case — re examined . l± 

Eq. 3.1.6. 

Lei the filter input be an impulse train I[t\T) = J2n S ( l " nT l Then 

« 
Ai(t,u)= [ *-""J3*(t-nT+r/2)5^tf[t-mT-r/2)d(. 
-L 

SnbEtituLing I' = t - \[m + n)T and r* = t ■+■ (m - n)jT, 

The quantity in braces te the time-Frequency autocorrelation function of an impulse 
6(t'), which 15 Ati^'if) = Sir') [see Classen & Mecklenbrauker 1960a . Thus, 

Letting k=n-m t 






The quantity in braces ia the fouricr transform of an impulse train T(t\ T), which is 



L 



tself an impulse train Y~I(iJ\ y 1 ) [hs* BTaceweil IflTaj. Therefor^ 



Jl n 

Eq. 3.4,2 shows that the time-frequency autocorrelation function of an impulse 
train is a rectangular grid of impulses spared T apart along t and. 2-ir/r apart along 
f (see Figure 3,3}, f Eq. 3.4.2 is the two-dimensional analog of Eq. 3,1.7. 



| Sieb-ert |11?56| flM derived th« ti me- fruque ncy auiucurriilatitiij function far & tTftUl of pu]fli?fl of 

arbitrary shape, »rr-?nlt that la Lmpaa-Litu Ln (lie theory- <if radar. The above- iwolt follows Formally 
Jr-«?nj ihfc if the puLcn art given unit ajta ajid approach i*ro v^idtb in tli* limit. 



jj3.4. Tiie alajjopary case — re-exAmiiiea*_ 



75 



200 



10U 



Hz o — 



-too — 



-7.00 — 



11 I I I ! I N M I I I I I I I I I II I I I 
-oo ?. -tun o 01 0.02 s *c 

Figure ft. A. Magnitude of the time-freqvency autocorrelation function of an im- 
pulse train (10 msec period) r 



2tn — 



100 — 



l> 



-100 



-200 



I I | I I M I I I I I I i i i i I i n i | i i 

-o.aj. -u.oi o o.o J 0.02 
Figure S.4. Magnitude of sne time-frequency autocorrelation function of the out- 
put of an LTI filter excited by an impulse train. In this simple example the fiJter 
consists of a single pole of 300 hx band width. 



13.4. The stationary case — f*-*x&mined __ 76 

From Eq^ 3.4.1 t we hive 

the two-dimensional analog of Eq. 3.1.8 r Ajfr, f) consists of a. rectangular grid of 
shifted f slices of j4^[r,jy) (nee Figure 3.4), 

Provided the tarms in Eq. 3.4.3- do tint overlap, A^t^G) can be recovered from 
A-v ( r i *0 ^ M ^y windowing it with a rectangular window that is centered on the 
origan and that has Lcnglb F, width 2jt/T, and height T"/2sr [see Figure 3. ft). From 
jtjd^Ojfffr) we can, in turn, recover {IT (tu)| T since 

— ee — « 
= |/T(i W )f. [3.4.4] 



On the other hand, if the terms in Eq, -3,4,3 do overlap somewhat, then a low-pass 

version of |ff(*w)| S ^ti still be recovered, aince 

where $(7 t f) is the time-frequency window, and $(r t w) Is its two-dimensional in- 
verse Fourier transform. In this- case, using a rectangular window on the thne- 
frequency autocorrelation function is a poor choice since its transform tings for a 
considerable duration away from the origin. A gaussian window minimizes this 
problem. 



§3,4, The stationary case — re-examined 



77 



zoo 



I DO 



H* o — 



-IUO 



-200 



<») 



I I I I M I I I I I I I M I I M I M [ I I 
-0j02 -COl 0.0 1 0.02 fl^£ 



JJI1- 

IttU- 

ittt- 

im- 

jit 1 



(b) 



*~l | , || i'' ' 'i' i ' f i ||| 'i'' | 'i'' p 'i 1 ' , 'i' Mf i 
• 1.12 i IfS t.175 i.| t.iif lis m.ns tJ 

Figure 3.5, ilec taj^iJar window (Very Jiearly) recovers 'rjnajj'aseo'- transfer func- 
tion, faj Windowed tiaae- frequency auioeorreiaiaon function in Figure 3.4. (h) 
Squire magnitude of transfer function, the 2-D Inverse fourler transform of £ (aj'- 
In the 'aliased* case, i.e., if tie lerms in Figure 3,4 were to overlap sign.ifi.cantly f a, 
gaussian window woufd be more appropriate. 



5 3.5. LJncariy varyr nx in< \ ■ ■' u -U f Jr> n Jt^ucjicv . TO 

Let us examine the form of A\{t v u) assuming foi now that the filter consists of only 
a tingle pole T i.e„ its impulse response has the form of Eq, 3,1,5, Then 

= ^„ T f civi^it - 1,1/2)*-^ # 

(3.46) 



^a„-w//S)l|T| ( uif*T 
This last equation N the two dimensional analog of Eq- 3.1.0. 



Thuij provided the pole bandwidth i* large enough, windowing A v (t^ f) can recover 
most of A±(r, v) t and, hence, a low-pas* version of |ff(i"w)| 3 . 

3.5. Linearly varying modulation frequency 

We now consider the case where we want to estimate the transfer function of an FM 
filter that has a linearly varying modulation frequency T Le. 5 Tf(t) = mt in Eq. 3.2.2. 
This means 

MM) = W*~ fty*"** 1 "^. t 3 ^- 1 ) 

The previous section was the special case tn = Q, 

Let US find how passing a signal through such a filter modifies its time-frequency 
autocorrelation function. Aa usua^ we let x{t) represent the input Ld tut* filter and 
\f{() the output. Thus, 

OB 

v[*} = j x[*t)h(t,a)da 

—ens 



.v- 



jj3,5, Linearly varying modulation frequency 79 

Letting x(t) = x[t)e-<'i mt * and Q[t) = j/[(]e _ *W\ we have from Eq, 3,5,2 and 
Eq. 3.4.1, 

^ jK f)= j M {U v) A ho [v - i, *) (ft. (3,5.3] 



-oa 



In. other words, the time-frequency autocorrelation function of y(t) consists of the 
convolution of the time-frequency autocorrelation of x(t) and ho[t) along the r 
dimension. 

We ire more directly interested in A x ajid A v , than A± and ,■*£. But this >„st 
transformation in &Ltnple T since the time-frequency autocorrelation function has the 
following nice property; if i(t) = x(t)e~'i int3 s th*m [Van Trees 19Tl] 

In other words, multiplying a signal by a linear chirp shews ita time-frequency 
autocorrelation function along the v dimension (see Figure 3,6). 

Combining Eq, 3,5.3 and Eq, 3.5.^ we see that 

CD 

A w{ T <* f )= j A ( (( h l/+m[(-r])\[r-^-n,T]d(. (3.S.S) 

— M> 

lit words, the timfrfrequericy autocorrelation function of a signal passed through the 
filter in Eq. 3.5.1 can be found by first shearing its input time-frequency autocor- 
relation function, convolving that with the time-frequency autocorrelation function 
of h Q (t) t and then shearing the output time-frequency autocorrelation function in 
the opposite direction, all with respect to the if dimension (see Figure 3.7], 

When the filter input is an impulse t/ain Ifi&r], the filter output is 

^(^l-^Sf-UXf-^^Tj^.-mlr fen—?)- [3,5.6) 



§3.5. . Li nearly varyjng .jnoduiafion frequency 



I 

— i ► 



v L 


^ 1 










— 












k- 






1 


r 



Figure S.d MuHiptying *■ signal x[t) bye imi ihcars Its time* frequency autocor- 
refotion Function: A m (r,u-y mr). 



In other worda, jl y [r t i/) consists of a rectangular grid of shifted r slir.es of .4jt,.,(r n v) 
that have be*n sheared in the v direction by slope m {see Figure 3.S). 

If these terms do not overlap, then we ran window A v {T t u] about the origin and 
recover the single term ^i (f,D)M l '" mi '] j W* can t ^ ietl ta ^ c J- s i" v ^rs« 2-D fourier 
transform to obtain Iffj^ w)j a i 



and froniEq. :< <! :i, 



(see F Lyu.ru 3.0), 



iffofi^-wuHr, 



= |J5T[I i *j)|' 



(3-5.7) 



On the other hand, IF the terms in E-q. 3,5.7 do overlap somewhat, then a [ow-pasu 



S&5. LiitMffy varyiitft Jtiodukfion foe qugggll 



81 



*(0 




*'- ,K a(l+r/2]t*f|-r/2)rf* 



AiCtpf] - A,(r,(/ + mi-) 



Convolve 






<jf 



-Shear 



Aj^i') = A j (r t j/ - fnr) 



Figure 3.7. Oblafjnjjig the tou^frequejuy airiocorttJation function, A v {t,w}, of a. 
signal x(t) passed through the £lter in Eq. 3.5, J. 



§3.6, The quagfotflliojiary case 



M 



200 



100 — ' / 



H* o — 



-ion 



-ZUU 



/ 



/ 



/ / 



• 



/- 



/ 



I I I I N I I I I I I j I I I I I I I I | I | | 
-0.02 -o.ui n no i 0.02 

ytf€ 



Figure 3r8 r MagJlft U de of fr me- frequency au locorrrla t iun function of the Output 
of an FM filter wj^/i linearly varying rnacfu Jaijon slope fJO Hxfna&cc} cj&eated by 
an JmpitJ'ae train (10 msec period). In. this axamplc, the corresponding LTI filter 
consists of a single pale of 300 hs tufidwidth. 



version of |J/[l ( w)| a can still be recovered, aince 



r-l 



f-'\9{r r u)A t it^)\»7 



-L 






= ^[( t u)M| J H («(w-m*))f 



(3,5,8) 



where *fr,x^] ia the liiTiG-frequeney Window, and $(fnW) is its inverse fourier trans- 
form. A 2-D gausaian window ia used, and its dimensions arc matched to the period 
T and the fundamental frequency 2w/T i respectively (see Figure 3.10). 



E3.fi. Thf. Q jja-s [ ^ ftA f jajT^Lj- !r' ease 



S3 



200 



100 



He o- 



-100 



-AM — 



y 



W 



i 1 1 1 i 1 1 | J 1 1 1 1 1 1 1 1 I i ii 1 1 rr 

-cos -o.[ii o o.oi o.oj sec 




IDD4 



(b) 



| > i r i p i r i r : r p it i i iiijiii i | ■ i ■ i | 
».tzs ui *.ija a.i i,i25 i. ]s m.ni :i 



Figure 3.9. jR« ta.ugu lor window {very nearly) recovers 'ima/iased' transfer func- 
tion, (a.) Windowed time-frequency aulocorrelathn function in Figurs 3.8. (h) 
Square ma-gmt n de of transfer function, the2-D inverse faririer transform of '(a) 1 . In 
the 'aliased* cast, i.e., if thf. terms in Figure 3.S weuns to overlap, a gavssian window 
would be more appropriate. 



!i3.6. The giiEsi-Bta.tionarr case &£ 

So far h we have shown that the tame-frequency filtering can be used to estimate the 
transfer function of two kinds of Linear filters — time-invarlact and I'M fillers- with 
linearly varying modulation frequency. We now show that more general cases wit] 
follow from the time Locality of this operation, 

3j6 + The quasi-stationary case 

We next consider the quasi-stationary ease in which the vocal tract changes slowly 
aver time. The traditional way to deal with this situation is to extend the stationary 
arguments (Section J.l) by substituting the short-time spectrum, for the spectrum 
of the entire signal . There are thus two windows involved in this analysis the 
spectrogram window, u?s(i}> «nd the autocorrelation function, window,, m>(r). 

The "two-dimensional 1 approach that we have outlined above extendi directly with- 
out the need of an additional window. In fact, the estimate ot \H[t t <jj) |* is a. positive 
representation of the signal energy 

so from Eq. 2,6,5 we know that jjyf^w)! 2 effectively depends only on signal values 
within a few ct of (o. \ Provided the quasi-stationary signal docs not change much 
over this interval,, the stationary results of Section 3.4 generalize immediately. 

These two approaches fur quasi-stationary signals, the former using a 1-D window, 
tsj(*), on the signal and a 1-D window, v/ A (r) on the autocorrelation function, and 
the latter using a single 3-D window, *{t,k) on the time-frequency autocorrela- 
tion function, are related. In fact, <aV(r,w) = A^Jt^u}^-). The latter approach 
specifies the time and frequency scale of interest independently with each of the 
dimensions of the window 4{r, i>), This is somewhat cleaner than the former, which 



f Provided irj-aj^ > ft, 



jjg.g, The vocal 1 tract transfer fitnctio n_ #j 

selects the time and frequency scale* with its two window^ w s (t) and w^t), bu£ 
not independently. 

3,7* Smoothly varying modulation frequency 

Suppose Che modulation frequency l(t) in Eq, 3,2.2 varies smoothly as a function 
of time. In other words, it is approximately linear locally, with V(e) small r For 
example, a formant with a trajectory that does not have sharp bends in it can be 
modelled this way. By comparison, quasi-stationarity requires the trajectory have 
shallow slope, La, */[!] is small. 

The locality argument need in the preceding section to show that the estimate of 
\H (i,itf)| extends to the quasi-stationary case applies equally to the case here. If 
the modulation slope,. V(t') t does not change much over an interval of a few & t , then 
the results of Section 3.5 on filters with a linearly varying modulation frequency 
generalize immediately to the smoothly varying case. This is because |/f(t,w)j a 
depends only locally on the signal. 

3,8* The vocal tract transfer function 

Thus far, we have defined the notion of a frequency modulated filter and its time- 
varying transfer function, and we have shown how to estimate this transfer function 
from the output signal, provided the modulation slope varies sufficiently slowly. We 
did this because we modelled each farmant pole as an FM filter. The vocal tract is 
modelled as a weighted sum af (ormant poles, i.e., its impulse response is 

N 

where h n [t t a) is the impulse response of each, pole, Eq. 3.2,1 (cf. Eq, 3.1,4), 



§3.8. The. yog aJ tract transfer functio n 



Sfi 



How can we define the transfer function of iuch a filter? Extending the stationary 
cage (Eq. 3.1.1) would miggeal 

There arc two advantages of this definition. First, it is a simple general izaton of the 
stationary case; it allows us to think of transfer function of the time-varying vocal 
tract at a giver time i as equivalent to the transfer function of a stationary vocal 
tract for the current articulator y configuration. Second, we shall ahow that it can 
be estimated from the Speech signal, by the methods we have already presented , 
in factr These two conditions, which we can call abstractly phonetic reJevance and 
cnmpiii ability, are probably the mast important for any representation to Ratisfy 
in the analysis of speech, Unfortunately, there is no simple relation between the 
system's eigenvalues or the time- frequency representations of the input and output 
signals and this definition of lime-varying 'transfer function 1 . These latter notions 
just do not generalize well to this time- varying case. 

Two facts show thai the transfer function in Eq. 3.8.2 can be estimated by the time- 
frequency filtering technique we have described above. The first specifies the effect of 
variable gain at the filter output on the transfer function estimate, which is given by 
Eq< 3.9,4 in the next section. The second specifies the effect of adding tho output 
of Hvo filters together Oil the transfer function estimate. Suppose that fi(t, 7) = 

fclftr) + k*(t,t) and that \H x [t,u)\\^{tM\ = ° ThEn l*C'. w )l a = |#i(^)| s ^- 
:j fir 2 fl t (^)| a . In other words, superposition holds provided the transfer functions do 
not overlap. This last condition means that we must consider only regions where 
the formanta axe not ton close to each other, as wt did in the stationary argument 
in Section 3,1. fcf- Eq^ 3,1,11}, t This relation holds not only for the transfer 

t Of ccurue, ftKttl'Lbntfl oF(-«a COmt c)rine copretker, but wt Lgnurs fllicll tim*-rra)U«IKy rtlfioilH Fnr *im- 

p]kky m thba argument- A rnnrr thorough treaitoe-nt would try to deal with (-lis?* resin-in aW 



§3.9* The transmission dimnel_ o* 

functions involved, but also for the estimates of the transfer functions given by the 
time-frequency filtering, since they ire positive representations of the signal. 

Using these two facta, we have 

r 



12 ^M]«M*)i*i*,e.< 



-^(tpw)**|JT(e,w)[ J [3.^3 



1 

T 

for the filter in Eq, 3.8.], is desired. 

3.9. The transmission channel 

It is convenient at this point to consider the effect of the transmission channel 
characteristics on the estimate of the transfer function [H[t, W J| S . The results will 
prove useful in the next section. We examine two cajses — the transmission channel 
ai an LTJ system with impulse response r(t). and the transmission channel having 
variable gain *(i). 

There are two facta about the Wigner distribution that we need |Claasen k Meek- 
[enbrauker lOSOaj. If p(t) =r(t)t j*(|), then 

DO 
-DO 

and ifgfj) = «(*)?(*), then 

%(*.«) = ^ / ^jrCi^^^w - a) da, (3,9.2) 



-«> 



In other words, in the first case the Wigner dlitritutions are convolved in time, and 
in the second case they arc convolved in frequency. 



§3A0. The excitation 88 

If the spectrat shaping of the first transmission channel is gradual, i.e., r{t) is of 
short duration, then from Eq. 3.9.1, W p {t,u) sj : ff(i^)| 3 ^Vj,(^u?). If the gain varia- 
tions of the second transmission channel are slow, then from Eq r 3,9 r 2, H^(j r w] as 
kWr^^sW). It follows from theec equations and Eq, 3-5-8 that 

J-VlfrvMjfr,*)! B i#[t lU )|Jl{i tf )Hirtt P «)r, (3-9.3) 



and 

*-■ |t(r,rVMr f »)]i» £*(»,«) .*|*(*)IW. W )I"- (3.fc4) 



*-1 r*f„ i.kj r* ■<)!« _i*ff j..1 •« l«tfkl*lH7i ,.;il a 



Thus,, these simple Mnda of transmission channels have simple effects of the transfer 

function estimate. The broadband LTI channel essentially sSiapcs the estimate's 
frequency slices and the .slowly varying gain channel shapes ita time slices. 

3.10. The excitation 

Up to now, we have assurrued the filter excitation has been an impulse train. We 
consider more general (and realistic) forms of excitation in this section, 

We can create a genera] periodic excitation from an impulse train by passing it 
through a LTI filter whose impulse response r[t) has the excitation's pulse shape, 
The output can then be passed through the time- varying filter h(( h a). Provided 
the spectral shaping by r[t) is gradual* i.e., r[t) is of short duration then these two 
filtering operations will commute. The assumption 5s t.hp.t the t;r™-varying filter 
can be considered quasi-stationary over the duration of r(t)u This is a reasonable 
assumption for the gradual spectral rollolTs produced in speech excitation, Since 
these two operations commute under these circumstance^ the effect of the fitter r[t J 
on the transfer function estimate 1a given by Eq. 3.9.3. 

Similarly, slowly varying changes in the amplitude s(j) of the excitation will result 



S3. 10. The excitation go 

in corresponding changes in the amplitude of the- filter output, with the effect on 
the transfer function climate given by Eq. 3.9.4. The pitch period need not be 
constant either, Using the locality arguments again, we only require that the pitch 
period changes slowly. 

Finally, consider the case where the filter is noisc-txtitcjd. Martin. & Flandrtn 
;i&6&| dlsyiusa using time-frequency filtering as a general approach for analysing 
non-stationaiy random signals Our model here involves not only norL-Btationarity, 
but also noise that ia not additive, and a careful theoretical analysis of this case has 
not been attempted yet, We rnuat be content, for now, with the following comment. 
We have Been in the previous chapter that these methods, can be used to select titne 
and frequency scales that remove the fine structure introduced by the excitation. 
This, of course, remains true for this case. 



Chapter 4. 

The Schematic Spectrogram 

4*1- Rationale 

In the previous chapters we have seen how Co obtain a well-behaved representation af 
the the speech energy, with a choice of the time and frequency scales of interest, For 
the Jiext Step we are faced with a methodological decision. If we are willing to make 
strong assumptions about the signal early on, then v)c can me those constraints 
in some detection scheme. For example, one can assume the speech spectrum Is 
composed of a number of pnles h and. use analysis-by-synthesis or linear predictive 
coding methods to fit these pales to the spectrum in a forma tit analysis. 

In this approach, a synthetic multiple pole spectrum Ls fit to each short-time spec- 
tram Typically, the pole frequencies can be varied, but for tiactabil ity the num- 
ber of poles and their band widths are held fixed- Stevens fc House [ I Q5 A] and 
Olive ;10T1|, for example, computed mean-square difference between log-magnitude 
short-time speech spectra and a function oF the form; 
jv 



le 



n^ i 

±1. tit.i — jt..\\it,i — a.. *1 



-\-k y fla-ff+IW™, (4.1.1) 



The poles of the synthetic spectrum lhaL is found to have the least 11MS error 



S 4. 1 . Rii tion alp. 



91 



are taken to be the formants. The permissible range for each of the poles is often 
restricted to the typical ranges for the corresponding formanta in this method. 
Different versions of this method are identified by the search strategy used to find 
the best match. Some have used exhaustive search (Stevens L. House 1955; Be]l, 
et aJ 1061; Matthews, et al iO0i|, so-calied analysie.by. synthesis. Olive|l371] used 
hill-climbing techniques. Linear-predictive coding can be viewed as fitting a flsied 
number of poles to short-time spectra, using a slightly different spectral distance 
measure than RMS distance [Atal 1971; Markei & Gray 1976]. The great advantage 
of LPC 1b that it provides a simple closed-form solution to the search for an optimum 

at 

On* problem with this approach, as stated, is that it depends an the quasi-stationary 
assumption. The short-time spectral contribution of a formant in rapid motion is 
poorly modelled as a pole with a bandwidth appropriate for a stationary formant. 
Even when the bandwidths are variable, as in the LPC technique, the diffuse spec- 
tral contribution of the moving formant can cause incorrect Formant matches. In 
principle, these methods can be generalised to the time-varying case. Liporace 
J1975], in fact, has done so for the LPC technique. 

This approach, however, suffers from a more general problem. The mode] user! to 
generate the synthetic spectra has little notion of the source or transmission channel 

characteristic, or of nasalization. These effects can contribute significantly Id the 
speech spectrum, "competing* for potes that were meant to be fit to the form ants, 
and thus often resulting in pote distributions that have poor correspondence to the 
formant distribution- The degree of the fit to a particular point in the spectrum 
depends on the entire pole distribution; i.e J+ on the number of poles used and where 
each pole is positioned in the spectrum. Thus, error* in one part of the Spectrum 



§4, J, R&tion&le 92 

are propagated to Other parts in the very first Stage in the analysis 

For example, Figure 4,1 shows pole locations found by LFC analysis using the 
autocorrelation method. The order of the analysis was chosen, as is customary. 
to allow for two complex poles per 1DUD Hz plus 4 poles for matching the overall 
spectral balance (e.g., 12 pole analysis for 4KEz filtered speech). A hamming 
window was used of 25 msec duration r also a typical choice. In Figure 4. La. we see 
that this analysis can. perform poorly in regions of rapid formant motion. In Figure 
4.1b,c + it appears that the addition of a. nasal resonance in the neighborhood of 
Fl resulted in Spurious t unstable behavior in the neighborhood of F3, Decreasing 
the duration of the window sometimes gives better performance in non-stationary 
situations, but increases the overall instability of the solution. 

The problem, in general., with mating such strong assumptions early on in the 
analysis is that they are seldom universally true. The excitation,, the nasal tract, 
and the transmission channel (e.g. room acoustics and noise) all conspire to make 
formant analysis more difficult than just fitting poles to a spectrum. 

The approach we take here is more conserative, influenced by a similar methodology 

applied to vision by Marr[l£8;j]. He suggested (l) the principle □/"feast commitment; 
make no decisions that may have to be taken back 1 later in the analysis, and (2) 
the principle of explicit naming; produce m rich and useful a symbolic description 
of the input signal as possible, but without any early commitment Lo lis physical 
origin. This description can be then further organized ant! analyzed with the goal 
of finding ita physical correlates. 

Applying these guidelines to speech suggests taking the energy representations as 
in Figure 2,15, and producing rich, symbolic descriptions of the significant features. 



jjj.j.. Ra.iiona.le 



93 



*t*t- 




LM*- 



III 



mflmf \ ! I 

'; . %. IXiZ.i... ...L L...J 



/i ' jl 







I" i ^ i i- 1 1 ■ i ■ 1 1 i i ■ | ■ 1 1 i j ■ i i 1 1 1 i 1 1 j p i * i 1 1 1 1 1 1 
■ LIS 1.1 l.ll I.Z LB I.] 1.11 1.4 



(a) 



4*nn- 



JDOn- 



2MI- 



inon- 




jma-t 



• ••3 l.| f,13 mi 



llll 



-4£A| 

■J 



■ ■ I j 
. -j 



A 



l"]Ti i » j i i i r | i l - -|t ' ' i 
■ MS ■.] lis i.J 



(b) (*) 

Figure 4.1. Examples of problems with 'polc-liiting' approach., (&) Fries locations 
for utterance fwioi/ of Sect job 2.9. Note tJic poor performance in the region* nf 
rapid F2 motion, (b) Spectrogram of /z/ in the context /en/, (c) Poles locations 
for this nasalized vovreJ, Note the spHrsOUS beiiavior in the neighborhood of F3. 



JH-g- SpectraJ. Peaks 94 

there. There are several features (at various scales] that suggest themselves: time 
diseonbmuites (up and down edges) useful for finding onsets, offsets and bursts; 
timfrfrequency ridged, easily seen in Figure 2.15, useful for finding the formante 
and perhaps channel resonances; and some form of gross spectraJ oa/antc niva- 
sure, also useful forformant and channel analysis. We call this composite symbolic 
representation the iThe/natic spuetTtigtalii. 

4.2, Spectral Peaks 

To create this representation* we must coihe up with computation* that. : .deuti:v 
these features. This is nut as easy as it [nay seem, since the Features clearly visible 
in Figure £.15 may neVertheJc-is require some non-trivial computations to detect 
reliably. We focus on how to find the time-frequency ridges, due primarily to the 
formants, in the next sections. 

An obvious way to try to find these ridges is to identify peaks in vertical slices of the 
time-frequency energy surfaces. This approach has. been tried by several authors, 
with, the main difference between the various instances bein^ how the smoothing 
was accomplished, Flanagan [195G| used a filter bank whose output was low-pass 
filtered, Schafer&Rabin-er used cepstral smoothing ; 'Oppenheim 1969; Oppenheim 
& Sh&fer 19T5] T while McCandless [l!>7-f| used LFC--bascd smoothing [Atal 1971; 
Mark el & Cray 1&76], 

To examine this, technique, Wt will Use the smoothed time-frequency surface* of 
Chapters 2 and 3, Since these surfaces are smooth, the spectral peaks can be 
found by looking for maxima, i.e., (negative} ^ero-croE.ai]]gs in ^F(t^). Figure 
4,2 show these points for the time- frequency energy surface in Figure 2,15. While 
the horizontal ridge due to Fl is well captured* the steeply rising F2 is very poorly 



14.2. S&L-etrat Peaks 



95 







' ' ' ' I ' ' ' ' I ' ' ■ 1 ' ' ' * I ' ' ' ' I ' ' ■ I | ■ i i i — i ■ ' i I t 
■ HI 1.1 LU *,] I.2J. *_J f,JJ ».i 



Figure 4.3* Peaks in speciraj eross-seciions- o/ tie ifine-frequency energy surface 
in Figure £l£. The entrgy ridge due to F.2 is poorly captured by this pea* compu- 
tation. 



captured, Thifl may seem ^uprising at first, but the ituon is simple. 



Eq, 3.5. B iriodfils the situation with F2. The formant pole P{w - mt) with time- 
Frequency slope m m smoothed by the 3-D gaussian #(( T w) to give F[t,ur} r This 
will produce a time-frequency ridge in F(t t w) that has. a roughly constant width, 
independent of slope m } when measured perpendjcuJar to the formaiit trajectory in 
the time- frequency plan*. However, the width of the ridge in a vertical slice increases 
with increasing skpe; evidently in Figure 2.15, F2 was sufficiently broadened that its 
spectral peat was completely lost to other effects in the signal, i.e., other formante, 
noise, the source and tranainissior. i-.fca.njie] characteristic (cf. Figure 2 A), 



N.3. Tiirw- frequency fidg&s. - nan-dhvciionft! Jrernel j*§ 

This effect is not an idiosyncrasy of our particular choice of time-frequency energy 
representation^ It is true, for example, of any representation computed with sifcttal 
windows [eg., any positive representation, by Tkm. A), flin.ee if the- forntant moves 
enough in frequency over the duration or the window, its spectra] representation 
will be significantly broadened. 

One couid rethink the design chokes for the time-frequency energy representation, 
trying for better spectral resolution at the expense of our chosen criLeria.. How- 
ever, the problem is not there, as a re-examination of Figure 2.15 will show. The 
F2 ridge is clearly visible in this representation, it looks no more broadened than 
the stationary FL This is because we ace both dimensions oF tiniE and frequency 
simultaneously,, acid as the for man E ridge broadens in frequency with increasing 
slope it narrows in time. Its prominence depends on its width perpendicular to its 
trajectory, which does not change much with slope. 

Why then did We confine our peak detection methods to vertical slices? It Was. the 
usual quasi-stationary prejudice of thinking of speech analysis in terms of a family 
of one-dimensional spectral analyRes paraiouLerizixl by time- Just like the energy 
representation problem, this problem is inherently two-dimensional and should he 
treated as such, 

4.S. Time-frequency ridges - non-directional kernel 

The approach we will use for detecting time-frequency ridges will depend on whether 
we use an directional or a noil-directional kernel for the underlying energy repre- 
sentation. If we use a non-directional kernel, the problem is simpler, so we shall 
address this first. In this case, we begin with * single time-frequency representation 
at a given time and frequency scale, aa in Figure 2.1&, and tlie problem reduces tt> 



B4.3. Tfrne-Jreouency ridges - non-direct'iunai kernel 97 

finding the ridges in this smooth, tWO-dirtlenional surface, 

How can we find ridges in a smooth, two-dimensional surface? This becomes a 
problem in differential geometry. Aa such, let us loot at the gradient and curvature 
vectors of the surface in the neighborhood of a ridge. Figure 4,3 shows them For the 

time-frequency surface in Figure 2.15 in the neighborhood of the initial steep F2. 
In particular, the solid vectors are used to depict the direction of the gradient, VF, 
i.e. j the local direction of steepest ascent. The dotted vectors depict the direction 
of greatest downward curvature, gdcJ\ i.e., the local direction in which the surface 
Curves the most downward from the tangent plane. 

A precise definition of gdc F is in order. We will use the second derivative as the 
measure of curvature — this is sometimes called ujinormafiaKd' curvature. This is 
used instead of normalized curvature {which has the form 0/|i -+■ (jf) 3 ] in one 
dimension] for two reasons. First, it is simpler. Second, tinnormalized curvature 
scales linearly with a change In the amplitude scaling, normalised curvature does 
not. If we use the former, our ridge computation proves invariant under changes in 
the amplitude scaling. 

Given this, we define gdc F as the direction vector of the minimum second direc- 
tional derivative at a given point. More formally, let 

«[>./) - (J: 55? (4.3.1) 

denote the Hessian matrix for F{t t f) r Let £ denote the eigenvector of H corre- 
sponding to the lesser eigenvector K. Then gdcf = £/|£|, 

Let us now return to Figure 4.3. As one might expect, the gradient points toward the 
tup the the ridge on each side ofit, but must swing through it as one passes over the 



J4 ,3. Time- frequency rfdjreg - nQttrdJrectianal kernel 
KHZ 



M 



155 



1.5 — 



1.45 









0.025 003 0.033 

SKC 



TT 



0.04 



Figure 4,3. Gradient afjri r.\tTva.ture vectors in the vicinity of the rising F2 in Figure 
2- IS, The solid vectors depict the gradient direction h and the dotted vectors depict 
the direction of getattet downward curvature. (The vector lengths are normalised 
to unity.} 



top. The direction of greatest downward curvature,, however, points perpendicular 
to the ridge in its entire neighborhood,, since a surface will curve downward more 
sharply as one moves toward and away from the top of a ridge then if one moves 
along it. Note that the two kinds of vectors will become perpendicular precisely on 
the top of the ridge. 



§4.3. Tiia^/feguency ridges - non- directional 1 kernel 9£ 

We define the ridge top as the locus of points that satisfy 

VF gdc F = and k < 0, (4.3.2) 

where t$ is the minimum second directional derivative. The inner product of these 
vectors is zero precisely when they are perpendicular, and k < insures, that the 
point is a ridge top and not a trough bottom. 

We now show this definition is equivalent to moving along lines of curvature on 
F[t, f) corresponding to the greatest downward curvature and noting passage through 
a peat on that surface. This gives an intuitively simple interpretation of a ridge 
top, and shows that gdcJ 1 * essentially provides the local ridge direction. 

Let $ : Jfi —* 9t ! be a parameterized, dirTerentiable curve with $'{$) = gdc-F^fs)). In 
other words, g traces- out a curve in the time-frequency plane that is always tangent 
to the direction of maximum downward curvature. When Fag goes through a 
peaki 37 f |ff(*)l = 0- By the chain rule, this occurs precisely where VF ■ jp'fj] = 
VF ■ gdc F =0. If a < 0, the curve goes, through a maximum, f But this Is just 
our ridge top definition, Eq. 4.3.2, as desired. 

The inner product in Eq, 4,5.2 is. easy to compute for each point on these time- 
frequency surfaces (one only need* the first and second derivatives of the sur- 
face, which are simpEe to compute for such a smooth surface). Since this quan- 
tity may vanish in between sample points in a digital implementation, we detect 
sera-crossings between adjacent sample points. 

Figure 4,4 shows the zero crossings in this, quantity for the time- frequency energy 
surface in Figure 2.15. Note that the steep forrnant peaks are now as well traced 

t Tliis uiairm \g" [ 9 )\ a magigibl*] {F * $)*{■} = tf[,) - Hrf[t ) + VF g"{*), wlere it equali the fint 
tarn. 



§4.3. Time- /requeues ridges - non-diraciionaf kernel 



1W 



4HI 




Figure 4,4 » Two-dlmensIonaJ rirf^e compulation applied to the izjijt:-fre«jtiejs<;y 
energy surface in Figure 2.15. The contours are thorn points w^ere the gradient 

direction and direction of greatest downward curvature are ptrpejidfcuJar. This 
compatRt ion. captures th« steep time-frequency ridges* due lo rapid formani motion, 
as weJlas tie JHOre h-arnsonial one:;;. 



Aa the stationary onea by this ridge top computation, The only thresholding per- 
formed here is the removal of points helow Lhe signal-to-noise ratio of the analysis. 
Thus, Fairly Low amplitude structure can appear in addition to the significant fcime- 
frequency ridgea. We will examine in Section 4-fi haw we to deal with such clutter. 



A few pertinent details have not yet been mentioned. First, to perform this- compu- 
tation, an aspect ratio has to be chosen between time and frequency, since it is not 
invariant under different relative scaling of time and frequency. The choice is nat- 



Siit Ti&ie-Fretf uency ridges - directional kernel „__ 101 

ural; we use the scaling inherited from the energy representation: let / = [tF ( /(7 u ) w , 

Thus, we perform our computations in the new co-ordinates., [i,f). 

Second, very high spatia] frequencies have been removed from the energy represen- 
tation already. Very low spatial frequencies also appear in the vertical direction. 
due to- amplitude variations and fomsant motion. We find better results when these 
are also removed by filtering; we thug, use a smoothed and flattened energy surface 
for th*. ridge computation. 

4.4. Time-frequency ridges - directional kerne] 

A second approach to the problem of identifying time-Frequency energy ridges uses 
directional kernels. Let F{t t f\$) be a family of time-frequency representations of 
the class defined by the kernel in Eq. 2 .$.5, where. 9 gives the preferred direction of 
the transform |i,e,, the kernel orientation), and the Other free parameter,. <Ti and 
era, are fixed. We would exp*ct in the vicinity of a time-frequency ridge and for fixed 
t and /, F{t,f\&) would be maximum when $ equalled che local ridge direction Co- 
in other words, when the transform's orientation is tuned to the local direction of 
the energy ridge. We would also expect that F[t(s}J{s] t t ] would be maximum 
at the ridge top, where f(fa),/(*)J is a curve that crosses the ridge perpendicular 
to its trajectory. The first case corresponds to a maximum under rotation of the 
kernels the second case corresponds to a maximum under translation of the kernel 
along the minor axis of itfl concentration ellipse (see Figure 4.5). 

The locus of points where these two maxima coincide defines a curve in the time- 
frequency plane, which we can take as our ridge top definition. That ia, we seek the 

points that satisfy both 

^(i,/^)=0 (4.4. Ia] 



U.4. Tiau^fnaueacy ridges - d irectional kernel 



i()2 




Lime 



Figure 4.S, Two conditions for ridge detection: (a) JocaJ maximum under ierneJ 
rotation, and (b) heal mAKimum under kttaaJ translation a^ong minor axis. 



ind 



£*&" 



ot o 1 / 

= 0. 



(4.4.1*] 



This computation can be implemented by calculating |f , j£, and Jf m a suf- 
ficiently fine grid of samples of (*,/,*), and then finding the simultaneous zero- 
crusslngs in the lefthand aides of Eq, 4-4. la and Eq. 4.4.1b. (The ai^na of th.fi 
zc-ro-croaeingR have to be examined tn insure that we have maxima and not min- 



H-4. Time-fteaunncr ri dg eg - directional kernel j ___ ipff 

ima.) 

We yet have to- specify the scale parameters <n and cj. Alternatively, we can 
specify <f s and r = Hi /er a . We can interpret dj as the size parameter and r as an 
eccentricity parameter, since the greater the value of r, the greater the eccentricity 
of the concentration ellipse for the kernel (when holding a constant]. 

The choice oft- depends on a tradeoff, Clearly, as r increases, time- frequency locality 
is sacrificed. In particular, bends in the time-frequency trajectory of mi energy ridge 
are poorly resolved with larger values of r. 

On the other hand „ larger values of r have an advantage in separating intersecting 
energy ridges, since the larger values of r give better selectivity to a particular 
orientation. We can quantify thia selectivity as follows. 

Consider the response of the transform at a frequency ft, to a complex exponential 
of frequency / . The value is independent of /<> and equals the value of F a (o, 0; & N r) 
when i.(t) *= l (i.e., f = Q), We can therefore define a tuning curve r[0 3 r) - 
Fj(0,0; fV) that indicates the selectivity of the transform kernel to different values 
of the orientation and eccentricity parameters. 

It is straight-forward to show that 

T(#, r) K — * (4.4.4) 

In Figure 4.6 this tuning curve is plotted as a function of 6 for several values of r. 

Even greater orientation selectivity can be obtained if we modify litis ridge top 
computation. The idea is simple; instead of maximizing the energy, F(t, f\$) t for 
various & in Eq, 4.4.1 a, we can maximize a more directional! y selective measure, such 



S4. 4. Time-frmucncy ridg e s dlrcciionaf Jrernel 



JTM 




i ^ l i n 1 1 1 p i ■ j . i n t " f 1 1 1 w 1 1 1 1 1 ■ ■ 1 1 1 1 1 ■ 1 1 1 1 > M 1 1 1 1 1 ■ | ' i ■ 1 1 ■■ 1 1 1 ' 1 1 ' i ' ' ' ' r ' ' ' ' i ' ' ' ' i ' ' ' ' I. ' ' ' * i 

■ II -»» -71 -** -II -** 'M -II -LI • LI » 31 4* SI II » II II 



Figure 4.fl. TtillfJl^ CUfv&s bAdwijib; efirctt tJOn&T selectivity of gaussi&fi transform 
karjieJs. 



as amount of curvature. In particular, we minimize the second directional deriva- 
tive perpendicular to the kernel orientation. But this is equivalent to maximiiriLiK 
the energy of the transform, that uses the modified kernel $(t,f) '■'■ •'■jjt^(^/); 
in other words we. use a modified Gaussian kernel in the computation specified by 
Eqa, 4.4,la,b. This new kernel has a central 'excitatory' region with 'inhibitory* 
flanks that give greater orientation selectivity See Figure 4.7, 

The tuning curve for this, modified kernel has the form 



r(*,r) oc caBHV*[$>r). 



{*■<■*) 



S-f-4. Time* frequency ridges - directional kernel 



?nn 
inn 



■ j;^y- vr^i^'M ; >^f!W'V ■■'■'i : .' ; 
?i'j.Vf:#¥! ;■■■:■■-'■ ■:> •fM : ' t '-:.''!" i ^-| 



■JM 

<ui.i 




>\£ i ■■.-:\i:il£ 
■ I ..■ i . 



■:lli 



• e _-i ..|..r. .. t. 
■^'1 - "1 ■ I . 

t/-. ,J . .'■;•; I; i i 
■ ■ '■'■ ■ lVi' V 






JOS 



-1 II'. 



-a.n« 



■ l ■ ■ ■ -1 

« *2i (IS 



sec 



Figure 4.T. Transform iemef ft*, /) = -^(ij), where flt, /) j S a 2-£> gaussian. 
Tijfl flew fccrrie? ha? a central 'excitatory' region with 'inhibitory * /Tanks that give 
greater orientation selectivity. 



In Figure 4.8 this tuning curve is plotted as a. function or* for several values of r. 
These indeed show greater selectivity than the corresponding, plots in Figure 4.6. 

It turns out that this computation is a generalization of the method in Section 4.3, 
In particular, if r = 1, then the two computations ate identical; i.e., those points at 
which the maximum downward curvature is perpendicular to the gradient direction 
are identical to those points where the minimum second derivative is parallel to a 
direction of zero slope. 

We therefore see that this section is a general ization of previous section. When 
r = 1, optimal lo-cal Nation in time-frequency resultE. As r is increased, some of this 
locality is sacrificed lor improved orientation selectivity. Thus, a non-directional 
kernel will give better results when there is only one ridge in the region, while an 



jj 4.4. Time-frequency rkhrfes - directional JcerneL 



106 



i.n 1 : 



' '■* . 



•.»- 



».T- 



l>- 



».s- 




P.J-! 



FigUTC 4,Sr Timing curves a/inwinff direrljonfld se/ee t jvi'ty of transform kameh of 
the Form In Figure 4.6. 



directional kerne] can give better n/sults when two ridges cross. 



Let ua examine these results on our example utterance from Section 2,9. For voiced 
speedy we. choose ffa to match the pitch period, and we let r > 1. Then the 
pitch witl be suppressed in each of the F{t,w\&) t using the results of Chapter 3. 
In Figure t-9 h we show the rid^e top analysis on out utterance usine the kernel of 
Figure 4.7 with r = 2 and t — 3. The case r — 1 wax shown in Figure 4. A. VVu set: 
that a less directional kernel (a, smaller mlue of rj gives better performance in the 
neighborhood of isolated formantB, while a more directional kernel (a larger value of 



S4-5, Signal d etection ami ridge identific&tiun ___ j^r 

r) gives belter performance in regions where twoformanfca 'cross* (see Knhn [1975] 
for a discussion On the "crossing* of formanta in natural speech.). 

4,5, Signal detection and ridge identification 

The preceding aec lions havi: been based on heuristic arguments. Can ridge identi- 
fier ton be formulated as a problem in optimal signal detection? We examine this 
question in. this section. Let ua begin hy making some particularly simple assump- 
tions for ease of argument. We assume that the received 2-D signal representation 
F[l,<*t) consists of a 2-T> determmistir. function S\t t -jj;t[t)), which depends mi ;i.i- 
unknown continuous function if(t), piui additive white 2-D Gaussian noise. The 
problem is to estimate f(t] t which models the path of an energy concentration in 
time-frequency. We further simplify the probtem by assuming that 5(t,w), which 
modek the energy ridge, ha^ the form 



m other words, it is a 2-D smoothed (i.e., broadened) curve (the square root factor 
normalizes the impulse for a unit step in arc length) . 

In a straight-forward 3-D generalisation of the derivation of a matched filter [see 
Van Trees 1968], the max i mum log likeiilood estimate of -j(t} is proportional to 

*[?(*)] = 2 / j FMSfatonifydtdu - J J \S[twi(t))? dtdw. [4.5.3] 
Substituting Eq, 4,5.1 into Eq. 4,5.2 and changing the order of integration gives 
AfrM] - 2J fi+WfflH*,T(t)) * - f j \S{t t un{t)}? dtdu, (4,5,3) 

where F ■ F**G, The first term is essentially a 2-D matched Slier in which 
the convolution F**G is matched to the signai shape. The second term takes 



j4.o. yj'jgJiaJ detection and ridge idvntiiicntiun 



108 




to 



(b) 



IPM 



•■ - 



'■ 



i 



'""■-II 



V 









-!E— <L-- 
J A. 



.« 



._JtL 

f \ 

i 

i .-■■ 



I 



Vf 



.' : I 

! I 
{-/fet I. 

■i- 



: Turn 



VI 



v 



i L. 



i ■ 



■ ■ i • i ■ 1 1 1 1 p 1 1 1 ■ — i-i-i-" — i — n — i ■ ■ ■ ■ i ■ i 

i.tl i.l LIC *J l.±l 1.3 i.M 



Fi^i""*; 4.9. JUdge iop aJiaiyBJS of /whi/ using the directional kernel of Figure 4. 7. 
(a) f = 2. fbj r = 3. Tie JMore directional kernels give belter performance where 
ridges intersect t but worse peform&nce at sharp beaux. 



■mes 



icdj;; 



§4.5. SiffftaJ detection and rid ge identific&tiim jq q 

into account the energy of the deterministic slgnaL The path 7(e) that maximizes 
Eq. 4.5,5 is the maximum likelihood estimate. 

Solving Eq. 4 r 5,3 for the best path is difficult. In particular, the second term is hard 
to evaluate (although it is proportional to the arc length of f[t) when it is sufficiently 
smooth). However,, an analysis-by-synthesis procedure could , in principle, be used 
to compute it numerically. Since we have assumed 7(1) is continuous, this beco: 
a glohal optimization over t and w. This is rather like one pole analyBie-by-synti 
with a continuity condition imposed on the pole trajectory. 

There is a fund amenta! problem with this approach, similar to the problem with 
pole-fitting approach discussed in Section 4,1. Because of the non-locality of the 
optimization, crnrs at one point can propagate throughout the solution path at this 
very first stage of the analysis. If the signal were well modeled by Eq. 4.5,3 and the 
noise well modelled by additive, white Gaussian noise, then this would nevertheless 
be the best we could do. Realistically, this is not the ca». In particular, the "noise* 
could include a second ridge.; one that we shouldn't treat as noise, hut as something 
to detect also. The detection scheme, as formulated, is too global. Instead, we nc-ed 
to make it more local in the time- frequency plane. 

Consider a email element As of arc length of the curve 7(e), which we can rotate 
and translate in the t-u plane. If we- hold its position constant, then for sufficiently 
small As, Eq. 4.5.3 will be maximized for that element if it Is oriented perpendicular 
to the direction of greatest downward curvature. If the element's Orientation is held 
constant, Eq, 4,5.3 will be maximised for that element if one translates it in the 
direction of the gradient. Together these imply that elements aligned on the ridge 
tops defined by Eq. 4.3-2 will locally maximize Eq. 4.5.3, in the sense that further 



E4-6- Continuity and grouping IW 

maximization requires moving along the ridge- Theae considerations show that 
the ridge operator of Section 4-3 provides a kind of local solution to the detection 
problem formulated here, 

4,6. Continuity and grouping 

We have seen that the ridge detection, methods of the previous sections produce 
pLpjcewise continuous contours. This follows formally from the Implicit Function 
Theorem] "^ particular, the- zeroes af a continuously differentiate function / : JZ 2 ► 
R. must form continuous contours in R' 2 . This continuity is a desirable property 
of the description since it reflects a constraint on the underlying acoustic events 
that is nearly always valid — loosely f that their spectral content varies (piecewise) 
continuously as a function of time. For example, formant motion is SO constrained. 
We explore several ramifications of continuity in this section. 

First t continuity helps to solve a practical problem in descriptions of this kind. The 
ridge description, as. it stands, can be cluttered with low amplitude peaks unrelated 
to sigjiinr.ani phonetic events. If We try to discard this unwanted structure by setting 
a threshold, We would have to keep it fairly low, otherwise we could throw out the 
baby with the bath water, breaking important contours into fragments. Continuity 
lets us use thresholding with hysteresis, which is often used in such cases |cf. Canny 
19B3[. The idea is to set two thresholds. Points below the lower threshold are first 
discarded. Points that are above the higher threshold are retained „ as are any points 
between tin; two thresholds, provided they lie on a contour that crosses the higher 
threshold. The result is that insignificant points are discarded without fragmenting 
more important contours. The technique can be quite effective; Figure 4-1(1 shows 
an example. 



§4.6. Continuity an d grou oing 



111 




*-% 1 1 1 1 j ■ ■ i ■ j , i , , | , , . „ | , , , , p j . 1 1 , j i ii t -' 

■ l« LI 1,tl I J |.H «J ,.,i !.l 



(b) 




" i ' ■ ■ ■ | ' i i i r » ■ ■ ■ | 1 1 i ■ ; ■ i ■ i I i f i ■ ] i i i i j i i 
Ml i.1 III l.l i,*| u ui 



Figure 4,10. Hysteresis thresholding applied to utterance /wioi/ of Section 2.6. 
fa) Two-dimensional ridge taps. Amplitude of the ridge top is depleted by the 
width of the contour, (h) Hysteresis thresholding of <(*)>. This removes isolated, 

iow Amplitude (mints without fragmenting the more Significant contours. 



%4. & Cont in uity an d grou i>j"jjg , HZ 

One may argue that any kind of thresholding is a. mistake, since unrecoverable errors 
can be made. Instead^ one should simply carry along the relative amplitudes and 
Strength^ oF the various points in the descriptions, and aubsec£UEnt processing can 
take these weights into account. This is, in principle, safer, but pratically it is much 
harder to think about processing a cluttered, weighted description than one that 
has been first cleaned up. So that the problem does not become too unwieldy at 
this stage, it is best for now to proceed with a cleaned up description. 

Continuity plays an important role in another problem — labelling. Our goal is 
La Eventually be able to label the points in the description with their acoustic 
correlates, e.g,, formant identification. This problem would be greatly simplified 
if a whole contour could receive a single label. For example, suppose points along 
the two contours in Figure 4.11 are competing, for labelling as F2. If the points are 
sampled every S noHec s then the points io a 50 msec stretch can be labelled in 2 lCi 
different W&ys. If each of the contouia, however, ta known to have a single acoustic 
correlate, then there are only two possible labeling!*, 

This Is a simple point, but it is almost univeraally overlooked. The usual approach 
has been to label individual points in a spectrum,, and then cither ignore continuity 

altogether, or use it to narrow the range of candidate labetllnga after the fact. The 
latter approach leads to a combinatorial explosion of possible labelling;;, Algorithms 
such as dynamic programming can be used to make this approach more manageable, 
but then the eEect oF even a single error can be catastrophic. A more direct approach 
h to Jirst identify Stretches of contour that will receive a unique label, with each 
deemed to have a single acoustic correlate. 

How can we identify such :1 atomic 11 contours? Ideally t our initial analysis would only 



SM: — Cufitinuitv and grouping TT;J 




Fa ra p» 73 



(ft) (b) 



Figure 4.11, Two Contours competing for l&betimg as F2, (&) One o/2 1,:> possible 
libeUingir of 50 msec stretch when a new label ean he assigned ev*jy £ msec, (b) 

One of two labelling* when whole contours receive a single label 



return such contours. Acoustic events would never he merged into a single contour* 
but would always be resolved as separate. I do not believe such a "perfect" analysis 
is possible. It is evidently possible to fool our auditory system on this account. 
Consider the spectrum of an /if m Figure 4.12a. By low pass filtering, the spectrum 
can be tilted to Appear as in Figure 4.12b. This will be perceived as an /u/; the Fl 
of the /i/ is tak*o u both Fl and F2. Conversely, an juj can be high-pass filtered 
to sound like an /i/ T with Fl-KFJ being taken as Fl. 

ListfiTiPi-s seldom malte the.EE kind of mistakes with more natural utterances altered 
by Lhia kind of filtering. This is because they hear them En context, with continuity 
being an important contextual cue. For example, consider Figure 4,13, which shows 
the spectrogram of /wi/, The f\f in Figure 4.12 was taken from this utterance. IT 
the entire /wi/ is low-pass filtered in the manner of Figure 4,12 3 it is perceived 



as 



U,6. Continuity a nd Krouplng_ 



1H 




IFIi 4HI 




im jiw j?h 4>n 



Figure 4.12. Turjiijag' ail /// into in /u/ L faj SAort-time spectrum of an /i/. 
fh) Low-pass filtered /!/. This will be perceived ss w /a/, 1a other words, Fl in 
perrerved as FI+F2- 



/wl/ 1 and nol as /wu/, Similarly,, a high-pass filtered /yu/ will not. sound like ft. 
ends in /i/. 



There are two points to be learned from these examples. The first Is that it is prob- 
ably not possible to always separate distinct acoustic correlates of nearby energy 
concentrations JocaJJy, i.e., they can be merged if heard in isolation. The second 
point is that more gfcfraJ constraints! such as continuity, can resolve these mergers. 



jj &r& Con t fa ij fry and j ^ rou pi jag 



n. r > 



*nnn 



.mm- •■ 



2tM 



!■■• 




Figure 4-13. Spectrogram of fwif. When, this utterance is low-pass filtered m 
in Figute 4.12, it is stilt perceived as fwif. Continuity of the form&atB aJJnws the 
correct perception. 



The ridge description will repngent sufficiently cloae formants with a single ridge, as 
in Figure 4.14, When the forroants merge, one of the contours terminates, and the 
other continues on. When the formanta split, a new contour appeara, while the old 
contour continues on. Evidently, some contours can change their tabel along their 
length. For example, the contour in Figure 4.14 that begins as F1+F2 becomes 
splits Into Fl and F2. Obviously, we can not label whole contours with a single 
label always. 



We, can, however, lahe] portions of contour* between splits and mergers with a 

single label. Said differently, if we identify the locations of splits and mergers, 



{j-tf.6. Continuity and grouping 



lie 



4B«d 



.mno 



2(1*0 



:ooo 






I Will™ 

[ ft I'T-i '■ lM'-m Wii 
...iL....,4i;j.a.j. i hJto^« i |.ii r . 




w 



«M-i 



.lflafl 



I ( ^ r i • i i ' i ^ i i ^ i j i i . i | 



uno- 



■ itt 




(b) 



i i i I 1 1 i i I ■ ■! i i I 1 1 i i I 1 1 i i [ ■ i i f 1 1 1 i p I i i r ' ^ 
■ l.ts I.I «.li i.2 i.JI I.J (JS 1,4 



Figure 4^14, Merged /ofmante. {*) Wideband spectrogram of ut*eranre Vfiy 
am", fbj ftidge tops. WAen FJ and" F2 approximate, their ridges merge. 



we can break the contours into a set of "atomic™ contours, in the sense that each 
contour win receive a single labelling. Since mergers arc spaisety distributed in 
time-frequency, we will eliL] have a small, manaKeahlc set of contours. 



JH.fi. ContinuitY a.nd g rouping ^ jj" 

The idea., then, is to augment out representation to include the locations of splits, 
mergers, and crossings of contours. Identifying these junctions w[\] serve two pur- 
poses. Fim, contour segments away from them can receive single labels along their 
length. Second s the junction iteelf can embody continuity constraints, since the 
junctions must be consistently labelled. For example, if iwn contours enter a junc- 
tion and one leaves it , we may label the exiting contour with the union of the labels 
of the entering contours. 

This is somewhat reminiscent of the junction labelling problem in the blocks world. 
Perhaps an efficient algorithm to propagate these constraints can. be found for dor- 
mant labelling as Walts J1GT5] found for the blocks, world. The problem here is 
greatly complicated by the fact that there can be many kinds of errors, e.g., a for- 
mant can be "missing**. Further, other factors such as spectral balance musL be 
taken into account. We wilt not attempt any labelling here. Instead, we provide a 
description of the signal that is a reasonable step toward that goai. 

Provided the ridge description is not too cluttered t which is the rule once low 
amplitude contours have been removed, the identification or contour junctions is 
relatively easy. In fact, using the proximity of contour endpoints to other contours 
is- a simple method. Two nearby endpoints. define a two point junction. Three 
nearby endpoints or a single endpoiilt near the body of another contoiir define a 
three point junction and bo on. Figure 4.15a shows junctions identified by such 
proximity rules. Contours that hoth enter and leave a junction arc broken there, 
while two point junctions can be bridged provided that simple "good continuation 71 
rules are satisfied. The result is a set of contours that are likely to have unique 
labels of their acoustic correlates along their length. Figure 4.15b ehows points 
where contours are broken based on these junctions. 



E : j,6 r . . Continuity and grouping 



II& 




(•) 



i.i t.ll n 



(b) 




• tuti hi i.ii- i.z in i.j i.n 



Figure 4rl5, Contour junctjojis JWafed, (*} Ridge tops of fwhif with junctions? 
J den tified by simpfe proximity rtiies. (b) Dais show p units where con tours aru broken 
based on these junctions. 



JUJfc AjZcTspxctiye n9 

4*T* A perspective 

We have shown that the above anaEyaia in some eircums canes can produce a more 
reasonable ScIiematiBation or the speech signal than, for example, LPC analysis. We 
will give many more example* of this analysis in the next chapter. Does this mean 
that the ridge analysis is uniformly better than Lrc analysis in speech applications? 
Th* answer is no. The simplicity and speed of the LPC algorithms make them 
attractive for many applications. Further, such pole-fitting models do work well 
in many cases. Since they embody additional constraints compared to the raw 
ridge analysis, they Will usually not make the 'mistake 1 of merging nearby formants 
together. Further, insignificant peaks usually do not affect the polo placements. 
This means that in clean, unnasaLitcd, quasi-fltatiunary male Speech LPC analysis 
Can be quite good. In such Cases, the ridge analysis may nevertheless merge nearby 
formants together and may include additional ridgea, making that analysis appear 
inferior to the LPC analysis. 

This probably means that the ridge analysis will offer no improvement in simple 

speech engineering applications to the widespread LPC methods. Frankly, the power 

and importance of the ideas presented here comes only when one asks the question: 

What methods wil] be appropriate for speech analysis in general, natural settings? 

Under such circumstanced, the transmission channel will often be imperfect and 

varying {e.g., wal ting down a hallway with open doors), there can be environmental 

sounds and nasalization present, and there can be significant non-sUlionarity. In 

these cases, the very constraints (i.e., all-pole. quasi-Stationary model with a fixed 

number of poles} that make the LPC technique work so well for [ tlean> speech can 

cause it to fail in these new circumstances, producing bizarre pole positioninge. On 

the other hand, the ridge analysis t a more conservative technique that makes no iucb 

assumptions, will still produce a reasonable schematizatlon of the time-frequency 

surface. A simple demonstration of these ideas is given in Section 5,6 below. The 

key idea is that strong commitments to the origin of the signal are not made at the 

level of the schematic spectrogram. It is only after the ridge tops, and undoubtedly 

other features such as time-frequency edges, temporal discontinuities, and spectral 

balance information have been made explicit will articulator^ constraints and such 

be brought to bear in this more general, least comittment approach. 



Chapter 5* 

A Catalog of Examples 



In this chapter we will apply the methods of the previous chapters to a variety of 
examples.. This will help us evaluate the strong points as well as the shor tcomings 
of the idea* presented. The ultimate lent can come only when these ideas are 
applied in a recognition scheme. This, however, has not been realized because of 
the many different components that need to be added, as indicated earlier. At this 
point, evaluation must be based on any intuitive appeal of the ideas, and on the 
performance Oil various examples Given that the goal ia to eaRentLally 'schematize 1 
Lhc information seen LJ1 (the sonorAP.t regions of) a. spectrogram, an ohvioiLS test 
is to see how reasonable the computed description looks when compared to the 
spectrogram. Given tliat previous approaches perform poorly in specific contexts 
(see Figure 4.1), clear improvements will be apparent. 

This situation ia similar to edj;e detection i:i icniigu analysis. The typical way to 
evaluate an edge finder is to look at its output compared to the image and ask how 
good it Looks. Perhaps a better test would be to aak how useful an edge finder 
output \s t s&y„ when applied to some scheme for finding surface discontinuities or 
stereo depth. But such a test requires confidence in the validity of the subsequent 
professing, since a bad application of a good idea can perform more poorly than a 



Oh. S. A Catftlog of Examples mi 

good application of a bad idea. 

In Section 5.1, we witl look it some genera] «xampk sentences. In the following 

sections, We examine sever&l traditional problem categories in speech analysis: in 
Section 5.2, we look at semivowels and glides; Section 5.3 nasalized VOWels; in 
Section 5.4, consonant-vowd transitions- in Section 5.5 female speech. In Section 
5.0, we look at some examples of the effects of different transmission channels on 
the analysis. 

5.1 Some general examples 

The first four EgureH of this chapter show the sentences, "May we all Learn a yellow 
lion rq*T, w , "Are we winning yet?* 1 , "We were away a year ago.*! and "Why am 
I eager?" spoken by adult males. These- sentence* were chosen because of their 
high proportion of sonorant regions and their variety of formant motion. We show 
wideband spectrograms and the 'ridge* analysis of the previous chapter for each 
of these utterances. First notice the generally good agreement between the time- 
frequency ridges seen in the spectrograms and those computed by the ridge analysis; 
the latter description is a reasortabEe partial 'sketch 1 of the former. This is true even 
in the steeper formant regions, such, as the various /w/'s and /j/*s in these examples 
and at the velar pinch in Figure 5.4 at .75 seconds. 

It is important to emphasize that these are not formant tracks, but ridge locations 
in the time-frequency surface. For example, when two formants come close enough 
to merge, as in the fwif in Figure 5.1 (between .2 and .3 seconds and about 21 DO 
HaJ or a portion of the /r/ in Figure 5 A (between .85 and ,9 seconds and 2000 Hz), 
only a. .-JEiglc ridge is found, [The analysis notes by solid dots the locations that 
contours should be broken because of possible mergers (cf. Figure 4,15), which can 



Ch r 5, A C&tahB. of Examples 122_ 

aid in subsequent labelling of the contours,) 

There are also ridges present that are not due to the oral formanta. For example, the 
ridge in Figure £.4 between .15 sen and .5-5 sec and at about 200 Hk is attributed to 
nasalization from the /jh/. Viewed as a formant tracker this is a failure, but viewsrl 
ad a ridge detector, this is a success. The nasal resonance Is strongly present in -the 
signal in this region and is correctly identified by the analysis- It is properly left- 
to subsequent processing to sort out which ridges are due to Formauts and which 
are due to other sources, This is quite different from the LFC analysis, where the 
presence of nasalization often causes sporadic and bizarre placement of the pole 
locations (Figure 4.1). In that case, subsequent proteasing would have difficulty 
sorting out the situation. 

Finally, there are various missing foxrn&nfcs. This particularly true far F3 when F2 
is quite low as In the /w/ in Figure 5.1. In these cireuinstiiuccE, F3 is driven down 
by the tail of F2, and is not really visible in the spectrograms either. We know 
where F3 is by contests but its- time-frequency ridge has essentially been driven into 
the noise. 



Vh. 5. A Catalog of fcx&ntpfas 



rj:; 




- <* 



!".-"-^fc. 



" "■' ,jr l r.l'r<:-'- ; - , "'-i' •■ m, L ~ m f- T 



. 



— ■="■*■ ..-_— .— c: 
■."T= ? ■ 4: ; - ••Ttr.'z—i: . 

',--■ — 4 ■_ ■■■■■■■_. [ 

: : 



'«r=r:r - 



]. -rrPB4jt-,i, P --.. , , j 



;■■■ 

L: 



'■'"'i 4 " i ~ " _ _ 







^£=p 



-3" i 

— -+»*i^-.— ,-vi^v. -i- 



j , ■ . _ , ; 1~ ■__ • ,-m 1 . ^Li= 




* i 




Figure 5.1. "Way we atf |*im a yf/Joi*' tion mar. 7 " 



(Jtt. .">. A C^iLiniog tff Kxnwpks 



IL'i 





F i^ii re £»2. ^Arv lie u m irj n/j yrl . 



CJi, 5. A Calah 



m 



JflCI- 



JDDO -> 



Jfll- 



lOOfl- 






■Hi— r-i h _ . 



ft yutp -liiiliiiJIi i M ! =i 

i T % * 1 % *] « if ■; i "[I ' .1 : * ' V m(iji,., 1 




M d7 M fl-S J.O i.i u 



«B(- 



JUUD 



20PQ- 



irco — 




' ' I ' ' ' ' I ' ' ' ' I ■ ' ■ ■ J ■ ' ' ' i ■ ' ■ ' T j , | 1 | i , | | | j t , | | 

1 0.2 0,] |.4 OS .,.., | iT D! 0.) ,. | r , |.j 



Fiuurc 5.3. "Up nrrrawavi war 



ajfti. 



f-to. ,5. A Catzhg ot~ f£xampkx 



m 




l-.-V.^Tr.'-" 1 - ^rrT^ i^-r.- 

.. ..Jl.'.lviL.^.. .,,, 



I 'I'll 




T"^ ■ | ■ I I I | I l l l | | | l T ^ 



i 



Figure 5.4. Uliyam i eager?' 



Shi!-. A Catalan, of Erampfaa [27 

5.2. Semi- vowels and glides 

In this section we *hevw eumples of /w/'s, /j/% ftps and flf\ The /w/'s and 
/j/'s are syllabic i ctitial in the context of /wtf and /ju/ in Figure 5.5 and Figure 
.■>,6 S respectively. A range of speech rates from alow to rapid is shown that gives a 
range of F2 formant slopes from sradual to steep. Note Lhe ridge analysis is fairly 
insensitive to this parameter. 

The /V's in Figure 5.7 are syllable initials with one example for each of lhe cardinal 
vowela, /}/> f&ef, fa./, an J /u/. The /r/** in Fi E ure SJ are in the context V/t/V, 

where V ranges over /J/^ /ae/ n /a/, and /u/. These too show some rapid formant 
motion that is well captured. 

5.3. Nasalized vowels 

Figure 5.9 shows syllable initial nasalized vowels in lhe contest V/n/. The vowels 
range over /i/, /ae/, /a/, and /u/. The main feature of this analysis is that addi- 
tional ridges are introduced due to the nasal 'formanfcs 1 . As mentioned earlier, this 
contrasts with the pole-fitting methods-, which produce erratic results, in nasalized 
vowels (Figure A r l) r 



Cil._5. A_ CatsJog of l m '.van\}>ii\s 



128 



*•••-! 



HI II II 



2**9- 




A***~\ 



im ~tiBii 



I.I *rl trJ >-4 i.S 




KHn 



JIM" 



-3 "■ 




~ ■ 1 1 * I 1 1 1 1 I » 1 > ■ I ■■ 1 1 1 I 1 1 1 ■ I 
■ 1.1 1,2 hi 1.4 f,S 




■ t.l 9.-Z 1.3 1.4 i.l 



(a) 



<*>) 



Figure 5,5- it. 'a a I rur'uyas ap*>*fh rates, ftt-bj ii"'J . fr-Cj SvJfaJj/r pn f i hi i /Wi 
(rcml inucd.,.) 



£k: A A. Catalog &f JExa.mpl 



r.f 



129 




f ' - - 



\ 



I 



I 



i i i i | i 




£*? 





i i i > i i > i i i i 



Fifillrr- S.5 (<OljttJiueHj). ■., V m various ZJHVrh ralw. fa-b) UWi. r fr-f} Syltabh 

in ilia} wr r 



Ch-. & A C atafcy of Examuhs, 



J30 



^Doe 



xtaa 



2IH9- 




4M ■ i 



M9W- 



1DQ0 



I | r I | I | i i i I | i i , — i | i — 

9A 0.2 1,3 




»«-* rJ 



toon 



JQOI 



jOi*- 



10 DC ■ 




49t*-r 




(1 0,1 1,2 I.J 



9~\ ' ' i i | ■ i ! ' | i i r r | i T 
I C.I 0.2 9.1 



FJKlll'C S.C M //a /jJc nj M fa I . ju y M various ypmch r&lfs. (e'oiiE'cL,.. ) 



Ch. .i. A Cat? fag afJkasrflflte 



Ml 




ft 



1 ' ' I J I 




(1= 



-I — [ — S - 1 



; B • • 




T J i i i I i i i i I i 



} 



I 




"1 — i — i — i — i — I — n— p-i-p-r 






-r-i- 



Fi«urtn 5.G (rn-mS.n.^djj, .sWtoW* iuiriaf ju/'s ail various speerli r«1 



es. 



!CJLJL._J Catafof of Examples 



132 




-T 



l ' ■ ■ ■ ' I ' ' ' I 



■t i i ■ ■ r' 



\ 




l i i i i r ' ' ' ' m ' • ' i 



^JZI T - , .^i" . ■ ^^^mj - - rj 



■=&— ■ ^ 




L -h'- 



■" :: ^^>" i r."Ji:r-'- ; :- 




- 1 — ■ — t — : — i — i — i — I — i — i — i 



%• 




4-*1 



i 






7=^-- 




f - 1 1 — I 1 — | 1 1 — I — * — | ■ 1 — ■ 1 — f~ 



! 



\ 



i 




s 



s 



J 





1 

o 



I I ' 



-•-i ■ ■ ' i 

I £ 



FiiilllT 5r7 r ."'H llubfi- i/filin} i ".■>. 



Ch. 5. A Catalog of Examples. . 



133 



4AM- 



.lliff- 



ZMl- 



llll 




4[lOl> 



JtlQO 



JIHO- 1 



Bt II V 
|iiiijii|i|rrirj 

t 1,1 1.2 «.] 9,4 •-S 




una 



4IHfl 



»!•- 



1IM- 



i*«- 



41111 




•f -^- 



i M I | i r r p [ « t i i | i i i I | p I I i J 

* 1,1 i.2 i.j 1.4 t.3 




2 9.3 D.4 



/'«/ 



f&rmf 



FiKure 5.8. /r/'s in various vowel contexts, (cont'd,**) 



Ch. 5. A C&tAJOR of E x&mpl&j 



\-il 






JAM- - ■;!!■: 



innn- 







i • m t-\— -lM^^t^- ■ : - 




|JL^% 




ii i | i i i > | ■ i r i j i i r i p p i r i | r i r f 



n 

a o.i B.a i.j i,4 M 



-HTDC 



mi on 



2D90 



j noD 



T" — "P-> 



-^ ■ i 



: : 







i f n i 1 1 1 1 



I" "| " i-fii" 



ft C.I 0.2 fl..t D-.J (M fl.n 



«■•- 



Hi ■■ill 



?(IOO 



ltll- 




4IH-T 



in ii n 



2Mi- i 



logo J 



I I I ■ I I ■ I P 1 I I I I I I I I M p M I I I I I 

• i.i 1.2 i.3 a. 4 i.s 




l~^ 1 1 1 i r nrtf 1 1 i i 1 1 i ii 1 1 i i i| ii p -^j i i j i 
i-1 1.2 C.J B.4 t.S M 



/ara/ 



/uru/ 



Figure 5. -8 (coDt'd)* /r/fc in variant vowel contexts. 



Ok-. §. A Catafoe o f Exa meki 



m 




lappa- as 
''"■■■■■ _- 




-'-:jl-^ -^ riii^ ---. "i--" 



t~n 




ill t 

%* !V r J. ^ 

' I I i 



i i i i i i •■- 



( 



= 



I 



i * ' ' ' i i i i i i 
til 



- f I ■■ < | IV 



ri ^ 





IT 



T 1 — I 1 1 1 1 — 1 1 1- 



T 1 1 ■ |{ I I I I 



■ 



V 



tri hbi-i --^ t-- ■- 



: i i 

/-- "4--1 






'# 



= 



-i— i— i- 



■ ■■pi i i i 



r^->- 



T 

7 
+ 



n 



r-r- 



I" 1 ' ' ' I ' ' ' ' I 
. s i 



-T' 



Figlin> fi,9. Sasalixtnl v&arris. 



Ch. 5, A C atalog of Examples 136 

5*4* Consonant- vowel transitions 

In this section We show examples of consonant-vowel transitions. Figure 5.10 
through Figure 5.12 show syllahle initial consonlant-vuwel trarisllioiia. The Con- 
sonants range over the voiced stops /b/, /d/, and /gf and the vowels range over 
/j/, /ac/j /■/» and /u/. The Analysis is Ehown only after the consonantal burst since 
the ridge analysis Is inappropriate and peculiar in the hurst region. The bursts were 
located by hand in these examples Figure 5.13 shows more rapid form ant motion 
with the examples /bif in the context /tufrj/and /dw/in the context /tidwf. 

The ridge analysis brings out formant motions, consistent with the locus theory of 
consonant perception. This theory states that one of the f.ufis to the perception 
of consonants is the trajectories of the formants at the transitions Liberrnau, et 
al 1954]. For example, En many vowel contexts for adult males, F2 will have a 
trajectory out the consonant that has a Locus near ah out 1200 II z for labial s {e.g., 
/h/) s about 1800 Hi for alveoiars [e.g., /d/") 5 and above 2000 Hz, for velars (e.g., 
/gf)- This cue- 13 wed in Electrogram reading, hnt has been hard to exploit in 
automatic speech analysis, because of unreliable formant detection at the often 
highly non-stationary consonant- vowel transitions. 

The analysis here is batter behaved, capturing rapid formant ridges as well as 
shallow ones at the transitions. As noted earlier, however, when the formants 
approximate a single ridge is produced, The F3 ridge is also sometimes lost near 
the transition For this speaker; in these cases, F3 appears somewhat diffuse and 
hard to locate in. the spectrograms also, These issues, as well as how to locate the 
burst, wj|] present difficulties for automatic consonant detection. 



Cii. .i. ACjtt&fpg otKj^mpies 



is: 




m*— im... M .i.,..m W| f , . _. ^ ■M.-.ifM^pm 



1 I"-" — J 1 I I' I I 

Ml ___ J_ 






-■: .■...*. 

I f 



....1 



i ' * ■ i i i ■ i i , i i i i | 



3 





f ' ' ' i | i i i i ]-" 

I i 



tt±±tt 



n - ■ ■ ■ i 



& 

* J3 



I i ■ i i . i i i 



IS 



paac; 




i i i i ■ i 





■ • i° 



Figure 5,10, Syllable in'tUnf /fe/k 



C"J!r- $, A CalaJof of Examples 



m 



•?•■;• -J. «iMK™r 




\ ■■■; J---/- 



■ ■ ■ i i i 1 1 1 1 ■ p ■ i 



I 



is S 



a.i-:^... utid'iprii n'fjiTiBii 



...'._ T1..1. % . ■ _■ . -■ 

— „ , — ^i_ r , ^-^^-i.- . , I ^ qp.. .; 

. -"i' -■'' "ill" 1 i.^My — '■ 




-1 1 1 1 1 1 1 1 ] P 1 1 1 1 1 I I ' 



... 




I 






i ' ' ■ ' t ' ' i i r 




t — I — i — r -i — i — j"-r — i — r~ r '^ J 

>= m m 

a m = 



iS "w 



1 i i ■ i i i — i — ■ — ■ — ■ — | — piii - 

I ! = 

rt S S 




1 — i — i — t~i — i — n — r 



1 — ' — ' ' ■ I " 



111 



J 



* «D 



| ■ ■ I ■ 'III 1 1 1 — 1 — 1 1 1 1 — ■ — r- 



Fi^urc' 5-11- Syli&bfa initial </■ "^ 



Ch- 5. A Catalog of }lxampfes 



;:■:■> 



■ : -'-i'i^-*— "-ya^ r - 



..s -nut— 

-ft Jb 



■tW^u, . .,, •j*iU'--772^'r7-?* f S."i ^p. JJ L S* 



I ' ' ' ■ j 




/ 






3 



' ' ■ | ■ ■ i i | i — — i i i ^ -* 



^T^-'i i^-.p.i,^ > 




X 



...L = 



■r. 



1 I I I J E ' I | I I I I I I I I I !-■ 

- ■ "■ • • 




s 
s 



I 



1 




« w 




| l •*• 



-T- 



I 



I .. 



| ■ ■ ■ — ' | 1 1 — 1 1 — |— T" 

i = 



i 



- so 



r 



Figure 5.12. SyilMv initial g 



C h. 5. A Catalog of E x ample s 



HO 



4»Q 



JMI- 



i»it- 



iitt- 




■> ' ■ 



tfaM^rt"* -. .1 




4M<\ 



HUD " 



.■rilll 



tm- 



■ ■ ■ i ■ ■ ■ ■ i ■ ■ i ■ I ■ 1 1 1 1 1 1 1 1 ( 

• til t I D 11 t.2 l.ii 




1 ■ -■! •.! D.]S • .£ 



*nti-r ■ 



.into- ■ 



;nnn 



iudd- 




4D.D0 - 



III!'! 



iitt- 



!••• 




'■ K 




n i ■< i ' " ■ r " ■ ■ i " " I 



w 



Figure 5-13, Hvpiil formsnf rraii^fcfirjimt, fajl .hi,- jn tin- cm/tcxi "fuhf/. /M djj 
in fhi* context -rWu . 



Ch. 5, A dtaivg of Ex&mptes 



141 



5.5* Female speech 

Higher pitched speech, such as female and children's speech, present the problem 
that the harmonics of the (voiced) eraitatbn are fairly widely spaced, via. a few hun- 
dred Hertz or mora. This means that in a quasi-stationary analysis the spectmm. is 
lew frequently sampled than for lower pitched speech, resulting in poorer estimates 
of the vocal tract transfer function (if. Figure 3.2). Viewed two-dimensional ly, the 
Situation is more symmetric. For example, a* the Frequency of an impulse train 
is increased, the frequency Spacing of the impulses in it* tim^frequency auto cor. 
relation function f Figure 3.3} will increase, but their time spacing will decrease. 
Thus one will have poorer frequency Wopting 1 of a time-varying transfer function 
excited by this impulse train, but better time 'sampling*. 

The analysis preaented in Chapter 3 exploits this fact by matching the time-frequency 
window to the pitch. Higher pitched speech requires a window at a larger frequency 
scale hut at a tower time scale than lower pitched speech. The remaining analysis 
proceeds as before. Figure 5.14 gives an example with rapid F2 motion. Figure 
5.14a shows a wideband spectrogram of the nonsense utterance /uiumi/ from an 
adult female, Figure 5.14 b shows the ridge analysis using a time-frequency window 
matched to a 200 Hz pitch. 

Note that the J 1 ridge and the steep F£ ridge art well resolved. Where F2 and F3 
approximate, however, only a single ridge is found. Such mergers in the analysis 
are more common in higher pitched speech due to the greater frequency smoothing 
required. However, since leas time Ainoothing is required than for lower pitched 

speech, transient effects should, in principle, be better resolved. 



Ch. 5. A C&t&toe of Exampies 142 

5*6 Transmission channel effects 

Finally, we consider the effects of imperfect transmission channels on the analysis. 
In particular, we mil consider the effects of passing the speech signal through some 
simple LTI filters. While the examples we give are idealized, natural Environments 
can give rue to many kinds of transmission channel characteristics. Fit genera], 
human listeners can tolerate a wide variety of alterations to a speech signal and 
have it remain intelligible [see Licklider & Miller 1951 for a good review . That is, 
not to say one is unaware of the modification; e,g„ a pronounced room resonance 
ad lis a 'hollow 8 quality to the speech, but it does not destroy- its intelligibility. 

Figure 5.15 shows the frequency response of the transmission channels we consider. 
Figure 5.15a consists of a single pale at 1500 Hz of 750 Ha bandwidtt:. Figure 
Fj.l5b consists of a single pole at 1500 J\t. of 150 IT? bandwidth, and Figure 5,15c 
consists of a pole-zero pair - both an at 1S0O Hi b the pole has I GOO Hi bandwidth 
while the zero has 150 Hz bandwidth, Thus, the iirst channel consists of a fairly 
broadband, but non-uniform channel; the second channel emphasizes the signal 
energy in the neighborhood of 15Q0 Hz; and the third channel removes signal energy 
in the neighborhood of 1500. 

We show the effects of these transmission channels on the analysis of the utterance 
fvnaif from Section 3.S. Figure 5.16a allows the wideband spectrogram of this ut- 
terance passed through the firEt channel, and Figure 5.16b shows the corresponding 
ridge analysis. The effect of this broadband channel is minor when compared to 
the original analysis in Figure 4.10, Figure 5.17a shows the wideband spectrogram 
of the utterance passed through the second channel, and Figure 5.17b shows the 
corresponding ridge analysis. The effect of this narrowband channel is to add an 



CIlJL A C ataiog of ife amp_jgg_ 



113 



additional ridge at 1500 Ex. Finally Figure 5„lSa ehowg the wideband spectrogram 
of the utterance passed through the third channel, and Figure 5.14b ahowa the cor- 
responding ridge analysis. The effect of this narrowband 'notch 1 is to put an energy 
trough in the time-frequency surface, with the F2 ridge being partially cancelled 
in the vicinity of this notch. Compare this analysis with the LFG analysis o: this. 
filtered utterance shown in Figure 5,18c (using the eame analysis patameters as in 
Figure 4.1). We see there that the notch filter plays havoc with the LPC analysis, 
since the aero lies outside the scope of its all-pole mod*], This is anatagouB to the 
effects of nasalisation on LPC analysis. 



Ch. S. A C&UJog of Examples 



U4 



Sioo-i 



4001 



tnniH 



20te 



tooo 




(*) 



S 1)0(1 ■ 



41)0(1 - 



39 tO^ 



2100 



1100- 




[b) 



I — ^ i p i i | i — < i i | i i i i [ i < i i r i i i i | i i i — i— I 
l.t v.i 0.3 ii. 1 n.s n.u 



Figure 5,14- /viuiui/ uttered by an adult fvia&lc. (a) Wideband spectrogram, (b) 

Ridgn analysis. 



Ch. & A Cat&lon of Exa.mples_ 



145 



-i- 



- it 



-is- 




W 



l l ( i J i ■ | ■ I I I | ■ — i — ■ — i — f— i — i i i — f — i — i — i — i — 1 — r 

Jtt lMi im jdgd jjti in-in 




-I* 



-it-. 



'Jl- 




JlfD 




-It-- 



■is- 



■■■■■■■ * .- 



I ' ' ' ■ I ' 



Hz 



[«) 



j I ■ I ' I ' ■ I T^ 



«" :-.!• .11. I. (Mi ; tA 



Fl^re 5.15. 7>MsmJssH>u chuutak. (af 750 Hz bandwidth pole it ism Em (b) 

150 Hz bandwidth pofe at 1500 Th, (c) Faie-zerv p*ir af 1500 Hz of 1000 Hz and 

ISO Jfz bandwidth, respectively. 



Ch. 5. A Cat&lae of Examples 



M6 



W > 



... 1|.V.: - 




\*.<\n 



.,1.11. 



?ill 



iw*- 



l«i- 



»H 




0>) 



Figure 5,10- fmoif passed 1 through 1 rajinu'^j'oji eha/mef in Figure S.iJa (bnoacf- 
band 5J(er). (a) Wideband spectrogram, (b) Ridge ana/ysjfl- 



ft I 

; ■ iiii 



w 




■*rr» 



"in 



j*ti 



JiPi 



■rrr 



| VI. I 



|hi>h 



B.I I .' 




(b) 



Figure 5h1^- /wjni/ pa-Wflcf through fcrarjmrjssiQrj channel jjj Figure S.lSb (narrow- 
band fi?tflfj. faj W r jde.baj]d' sperlrofiraxn. (b) Hidgc nuiLlysis. 



Ch. 5. m A Cat alQjr of Examples 



147 






■ ■if 



1*0 




Aen t <i 




(m 



V*'\'J W"i 




Hit 



;*vp 



i'i' 



■■■■- 



i~i ■ ■ ■ ■ i ' ■ i i i i 



n" 1 ' ■ ' i 
i.j *.* 



w 



Figure 5,18. /wjaj"/ pasied iArough tranmi^ioii channel in Figure 5.1$c (notch 
£ltar). fa) Wideband spectrogram, (b) Ridge analysis, (c) LPC tunlys 



r£Ji\ 



Referenc 



es 



Atat, B. & Hanauef., S. 1971, Speech analysis md synthesis by Linear prediction of 
the speech wave. J, AcouSl. Soc. Am. 50, 6.37-655, 

BeU, C, Fujisake, H., Heia^ J., Stevens, K r , and House, A. 1961. Reduction of 
Speech spectra by analyaig-by-synthesia techniques. J r Acoutt. Soc, Am. 33, 1725- 

Beranek, L. 1G&A. Acoustics. New York: McGraw-Hill. 

Bouachurte, B ., Egcudie, B., FEandrin, P. s fc Grea J, 1979. Sur una condition 

necesaaire et sutHsante de poaitivite de la representation tojointe en temps et frequence 
des signeaux d'tnergfe finie, Compie Remis Acad. Sciences, 388, Serie A, 307-309. 

Bracewtll T R. 1978. Th* Fouler Transform and jfo Applications. New York : McGraw- 
Hill, 

Bnit, R, & Starr, A, 1976. Synaptic events and discharge patterns of cochlear 

nucleus cells, IL Frequency-modulated tones. J. NeurophysioL 3B. 152-178. 

Chiba, T. t & Kajimaya, M. 1941. The VoweV, its Nature and Structure, Tokyo: 
Tokyo-Kaiseikan, 

Claasen, T. & Meeklenbrauier, W. lOSOa. The Wigner distribution: a tool for tima- 
freq,uency signal analysis. Part I Philips J. of Reseuch. 3B T 217-250. 

Claascn, T. & Meeklenhrauker, W. 1980b. The Winner distribution: a tool for time- 
frequency sienal analysis. Part II. Philips J. of ftesearch. 3fi, 267*300. 

Claasen, T. in Mecklcnbrauker, W. l&SOc. The Wigner distribution: a tool for time- 
frequency Bijrna] analysis. Part III. Philips J r Res. 35, 372-389. 

GiaaEEn, T. & Mecklenbraukcr, W. J9S4. On the time-frequency discrimination of 
energy distributions: can they look sharper than Hd'iL'nbern? Proc, ICASSP '84, 3. 

4i n. 4. i-4. 

Cohen, L. 1%6. Generalized phase-space distribution functions. J. of 'Math, Pliys. 
T. 781-766. 

141 



De Briitjn, J. 1967. Uncertainty principlee in Fourier analysis. Inequalities.. ShLsha, 
D, (Ed.), New York: Academic Press. 57-71. 

Do Carmo, M, 1976, Differential Geometry of Curves And Surfaces. Englewood 
Cliffs, NJ: Frintice-Hall. 

Dudgeon, D. 1984- Detection of narrowband signals, with rapid changes ill center 
frequency, ICASSF DSP Workshop. October S-10. Chatham, MA. 

Fant h G. 1560. AcQimtk Theory of Speech Production. Hague; Mouton. 

Faiat, G. l&flQ. The relations between area functions and the acoustic signal Pho- 
netics, 37. 55-? G. 

Flanagan, J. 1056. Automatic extraction of formant frequencies from continuous 
Fjpsech. J. Aco-ust. Soc, An*- 2S. 1 10-118, 

Flanagan. J, 1972, Speech Analysis, Synthesis und Perception. 2nd ed. New York; 
Springer- Yerlag. 

Flandfin, P. 1984- Some Features of time- frequency repreBenta-tiOiie of nudtfcnmpo- 
r.enr. s^nals. Proc. ICASSP ; S4. 3- 41D.4.1-4. 

Greenewalt, C. 1968. Bird Song; Acoustics & Physiology. Washington, DC; Smith- 
sonian Institution Press. 

Hilte, E. UJS4. Functional Analysis And Semi-groups. New York: Aaicr. Math. Soe, 

H]awatEch 3 F. 1984, Theory of bilinear time-frequency signal representations. ICASSF 
DSP Workshop. October MO. Chatham, MA. 

Jansscn, A, 1983, On the locus and spread of psuedo-density functions in the time- 
frequtney plane. Philips J. Res. 37, 79-11(1. 

Josh a, P. 1982. Theory of evolving normal modes and the vocal tract. ParL I: Basic 
relations and adiahatic in variance, AcusiiCA. 53. 86-94. 

.losha, P. 1984, Theory of evolving normal modes and the vocal tract. Part U: 
Evolving frequency and amplitude. Aeusr-fca. 57. 133-135- 

149 



Kay, R, & Matthews, D, 1072. On the existence in human auditory pathways of 
channel selectively tuned to the modulation present in frequency-modi i; A ted tones, 
J, Physiol. 235. 657-677. 

Kuhn, G. 1975. On the front cavity resonance and its possible role in speech per- 
ception. J. Acaugt, Soc. Am. 58, 428-455, 

Ladtfoged, P r 1975 A Course in Phonetics. Xeiv York; Ifarcourt, II race, and Jo- 
vanovich. 



Liberman, A., Cooper, F., ShanlsweiJer, D. p Studdert-Kennedy, M. 1967. Percept 
of the speech code, America! Jom-jjaJ of Psychology, fib. 487-510, 



ioji 



Liberman, A.. Deltlter, P., Geratman, L. 1954. The role of consonant-vowe] tr 
tioiu in the perception of the atop and naeal consonants. Psychological Monograph 



ansi- 
i. 



es. 1-13. 



Lickhder, J. & Miller, G. 1051. The perception of 5 p* K h. Handbook of Experimental 
Psychology. Stevens, S. (Ed.) New York: Wiley. 

Liporace, L. 1975. Linear estimation of non-stationary signals. J. Acoust. Soc r Am. 
5S. 1288-1295. 

Lui n 3, 1971. Time-varying spectra and linear transformation, Betf Sya. Tfech, J. BO, 
2355-23 74. 

Loynes, M. 196S. On the concept of the spectrum for noiwtationaiy proteases. J, 
Roy. Statist. Sou., (B) 30. 1-20, 

Markel, J. fe Cray, A. lHTa Linear Prediction of Speech. New York: Springer-Verlag. 

Marler, P. 1077. The structure of animal communication sounds. Recognition of 
Complex Acoustic Sign&h, BuElock, T. (Ed,] West Berlin; DahJ em Konferenzen. 

Maw t D, 13B2. Vision. San Francisco; Freeman, 

Maw p D. i Hildnth, E. 1980. Theory of ed«e detection, Proc. R. Soc, fond, (B) 

20*. 187-217. 

ISO 



Man, D. & Nishihara, H, 1978, Representation and recognition of the spatial orga- 
nization of three-dimensional shapes. Proc. R. Soc. Lend, B 200. 269-ay^. 

Martin » W, fe Flandrin, P. 1085, Wignex-Yille spiral analysis of nanslationary 
procrasca. IEEE TYans. ASSP ASSF-33 1461- 1470. 

Matthews, M, Miller, J. & David, B. 1961, Pitch synchronous analysis of voiced 
sounds. J. Acotint. Soc. Am. 45, 45S-46ft. 

McCandlfiss, S. 1974. An algorithm for automatic formant extraction using linear 
prediction spectra. IEEE Trans. ASSP ASSP~33:2. 135-174 

Mailer, A. 1978. Coding of time-varying sounds in the cocle&r nucleus. AudjcJojjy:. 
1T + 446-46S. 

Morse, P. fc Ingard, K. 1968. Theoretical Acoustics- New York: McGraw-Hill, 

Neuweiler, G. 1977, Recognition mechanisms in echoic-cation in bats, Recognition 
of Complex Acoustic Signals. Bullock, T- (Ed.) West Berlin: Dahlem Konfere:i£L-jL. 

Olive, J. 1971. Automatic formant tracking in a Newton-Raphson technique. J. 
Acoust, Soc, Am. B0 F 661-670. 

Oppenhciin, A. 1969. A speech analysis-synthesis system based on homomorphic 
filtering J- Acolist. Soc. Am. 45. 458-465, 

Oppenhciin, A. il Sehafer, R, 1975- Digital Signal Processing, Englewood Cliffs, NJ: 
Prentice-Hall. 

PagG, C, 1952. Instantaneous power spectrA. J. Applied Phys. 2S. 103-106. 

Priestley, M. 1965. Evolutionary spectra and non-stationary prod'SEes. J. Roy. 
Statist. Soc, (B) 27, 204-229, 

Rabiner, L, & Schftfer, R, 1978- Digital Processing of Speech Sigsl&l$. Englewood 
Cliffs, NJ; Prentice-Hal]. 

Regan, D. $z Taneley n. 1979. Selective adaptation to frequency-modulated tones; 
Evidence for an information-processing channel selectively sensitive to frequency 
changes. J. AcousL Soc. Am- 65. 1249-1257. 

1*1 



Riley, M, 1083. Schematizing spectrogram* f 0r ap e«h rEcognitian. BTL Ttebmeal 

Memo. 11225-83-0924-07. 

Siley, ML 1984. Detecting time-vaiytn* ap «tra! en*nry concentration,. ICASSP 
DSP Workshop. October S-ia Chatham, MA. 

Rudin, W. 1073. Functional Analysis. New York: McGraw-Hill. 

Sal*h b 9. & Subotic, N. 19S5, Time-variant filtering of aignals | n the mixed time- 
frequency domain. IEEE Trans, touri, ^ pflecAp S ; W p„ eewijl ASSF-33 

1479-1485. 

Schafer, ft., & Rabiaer, L. 1970. System for aromatic forinant analysis of voiced 
speech, J. Acorifit. Sac. Am. 47, 634-648, 

Siebert, W. 1956. A radar detection phlloaophy. IRE Tram. Inform. Theory. PGIT, 

6. 204-221, 

Stevens, K. * Hon™, A. 1KB. D.vd op m«t of * qu«tit»tiv* description of vowel 

articulation, J. A court. Soe. Am. 2T. 4S4-4Q3, 

Van Trees, EL T.&68. Defection, Estimation, and Modulation Th^ry Pt J New 
Vork: Wtky* 

Van Tree*, H\ 1B7L Detection, Estimation, and MhfuJtCJen Theory Pt 5 New 
York: Wiley* 

Walt., D. 1975. Under. Landing line drawing* of scenes with shadows. Piycfto/ogy 
of Computer Vision. Winston, P. (Ed.), New York: McGraw-Hill. 

Zadeh, L. mo. Frequency anaJyaw of variable networks, Proc. IRE. SB. 291. 



142 



